944 research outputs found

    Data Clustering And Visualization Through Matrix Factorization

    Get PDF
    Clustering is traditionally an unsupervised task which is to find natural groupings or clusters in multidimensional data based on perceived similarities among the patterns. The purpose of clustering is to extract useful information from unlabeled data. In order to present the extracted useful knowledge obtained by clustering in a meaningful way, data visualization becomes a popular and growing area of research field. Visualization can provide a qualitative overview of large and complex data sets, which help us the desired insight in truly understanding the phenomena of interest in data. The contribution of this dissertation is two-fold: Semi-Supervised Non-negative Matrix Factorization (SS-NMF) for data clustering/co-clustering and Exemplar-based data Visualization (EV) through matrix factorization. Compared to traditional data mining models, matrix-based methods are fast, easy to understand and implement, especially suitable to solve large-scale challenging problems in text mining, image grouping, medical diagnosis, and bioinformatics. In this dissertation, we present two effective matrix-based solutions in the new directions of data clustering and visualization. First, in many practical learning domains, there is a large supply of unlabeled data but limited labeled data, and in most cases it might be expensive to generate large amounts of labeled data. Traditional clustering algorithms completely ignore these valuable labeled data and thus are inapplicable to these problems. Consequently, semi-supervised clustering, which can incorporate the domain knowledge to guide a clustering algorithm, has become a topic of significant recent interest. Thus, we develop a Non-negative Matrix Factorization (NMF) based framework to incorporate prior knowledge into data clustering. Moreover, with the fast growth of Internet and computational technologies in the past decade, many data mining applications have advanced swiftly from the simple clustering of one data type to the co-clustering of multiple data types, usually involving high heterogeneity. To this end, we extend SS-NMF to perform heterogeneous data co-clustering. From a theoretical perspective, SS-NMF for data clustering/co-clustering is mathematically rigorous. The convergence and correctness of our algorithms are proved. In addition, we discuss the relationship between SS-NMF with other well-known clustering and co-clustering models. Second, most of current clustering models only provide the centroids (e.g., mathematical means of the clusters) without inferring the representative exemplars from real data, thus they are unable to better summarize or visualize the raw data. A new method, Exemplar-based Visualization (EV), is proposed to cluster and visualize an extremely large-scale data. Capitalizing on recent advances in matrix approximation and factorization, EV provides a means to visualize large scale data with high accuracy (in retaining neighbor relations), high efficiency (in computation), and high flexibility (through the use of exemplars). Empirically, we demonstrate the superior performance of our matrix-based data clustering and visualization models through extensive experiments performed on the publicly available large scale data sets

    Organising a photograph collection based on human appearance

    Get PDF
    This thesis describes a complete framework for organising digital photographs in an unsupervised manner, based on the appearance of people captured in the photographs. Organising a collection of photographs manually, especially providing the identities of people captured in photographs, is a time consuming task. Unsupervised grouping of images containing similar persons makes annotating names easier (as a group of images can be named at once) and enables quick search based on query by example. The full process of unsupervised clustering is discussed in this thesis. Methods for locating facial components are discussed and a technique based on colour image segmentation is proposed and tested. Additionally a method based on the Principal Component Analysis template is tested, too. These provide eye locations required for acquiring a normalised facial image. This image is then preprocessed by a histogram equalisation and feathering, and the features of MPEG-7 face recognition descriptor are extracted. A distance measure proposed in the MPEG-7 standard is used as a similarity measure. Three approaches to grouping that use only face recognition features for clustering are analysed. These are modified k-means, single-link and a method based on a nearest neighbour classifier. The nearest neighbour-based technique is chosen for further experiments with fusing information from several sources. These sources are context-based such as events (party, trip, holidays), the ownership of photographs, and content-based such as information about the colour and texture of the bodies of humans appearing in photographs. Two techniques are proposed for fusing event and ownership (user) information with the face recognition features: a Transferable Belief Model (TBM) and three level clustering. The three level clustering is carried out at “event” level, “user” level and “collection” level. The latter technique proves to be most efficient. For combining body information with the face recognition features, three probabilistic fusion methods are tested. These are the average sum, the generalised product and the maximum rule. Combinations are tested within events and within user collections. This work concludes with a brief discussion on extraction of key images for a representation of each cluster

    Advanced Probabilistic Models for Clustering and Projection

    Get PDF
    Probabilistic modeling for data mining and machine learning problems is a fundamental research area. The general approach is to assume a generative model underlying the observed data, and estimate model parameters via likelihood maximization. It has the deep probability theory as the mathematical background, and enjoys a large amount of methods from statistical learning, sampling theory and Bayesian statistics. In this thesis we study several advanced probabilistic models for data clustering and feature projection, which are the two important unsupervised learning problems. The goal of clustering is to group similar data points together to uncover the data clusters. While numerous methods exist for various clustering tasks, one important question still remains, i.e., how to automatically determine the number of clusters. The first part of the thesis answers this question from a mixture modeling perspective. A finite mixture model is first introduced for clustering, in which each mixture component is assumed to be an exponential family distribution for generality. The model is then extended to an infinite mixture model, and its strong connection to Dirichlet process (DP) is uncovered which is a non-parametric Bayesian framework. A variational Bayesian algorithm called VBDMA is derived from this new insight to learn the number of clusters automatically, and empirical studies on some 2D data sets and an image data set verify the effectiveness of this algorithm. In feature projection, we are interested in dimensionality reduction and aim to find a low-dimensional feature representation for the data. We first review the well-known principal component analysis (PCA) and its probabilistic interpretation (PPCA), and then generalize PPCA to a novel probabilistic model which is able to handle non-linear projection known as kernel PCA. An expectation-maximization (EM) algorithm is derived for kernel PCA such that it is fast and applicable to large data sets. Then we propose a novel supervised projection method called MORP, which can take the output information into account in a supervised learning context. Empirical studies on various data sets show much better results compared to unsupervised projection and other supervised projection methods. At the end we generalize MORP probabilistically to propose SPPCA for supervised projection, and we can also naturally extend the model to S2PPCA which is a semi-supervised projection method. This allows us to incorporate both the label information and the unlabeled data into the projection process. In the third part of the thesis, we introduce a unified probabilistic model which can handle data clustering and feature projection jointly. The model can be viewed as a clustering model with projected features, and a projection model with structured documents. A variational Bayesian learning algorithm can be derived, and it turns out to iterate the clustering operations and projection operations until convergence. Superior performance can be obtained for both clustering and projection

    Improving Clustering Methods By Exploiting Richness Of Text Data

    No full text
    Clustering is an unsupervised machine learning technique, which involves discovering different clusters (groups) of similar objects in unlabeled data and is generally considered to be a NP hard problem. Clustering methods are widely used in a verity of disciplines for analyzing different types of data, and a small improvement in clustering method can cause a ripple effect in advancing research of multiple fields. Clustering any type of data is challenging and there are many open research questions. The clustering problem is exacerbated in the case of text data because of the additional challenges such as issues in capturing semantics of a document, handling rich features of text data and dealing with the well known problem of the curse of dimensionality. In this thesis, we investigate the limitations of existing text clustering methods and address these limitations by providing five new text clustering methods--Query Sense Clustering (QSC), Dirichlet Weighted K-means (DWKM), Multi-View Multi-Objective Evolutionary Algorithm (MMOEA), Multi-objective Document Clustering (MDC) and Multi-Objective Multi-View Ensemble Clustering (MOMVEC). These five new clustering methods showed that the use of rich features in text clustering methods could outperform the existing state-of-the-art text clustering methods. The first new text clustering method QSC exploits user queries (one of the rich features in text data) to generate better quality clusters and cluster labels. The second text clustering method DWKM uses probability based weighting scheme to formulate a semantically weighted distance measure to improve the clustering results. The third text clustering method MMOEA is based on a multi-objective evolutionary algorithm. MMOEA exploits rich features to generate a diverse set of candidate clustering solutions, and forms a better clustering solution using a cluster-oriented approach. The fourth and the fifth text clustering method MDC and MOMVEC address the limitations of MMOEA. MDC and MOMVEC differ in terms of the implementation of their multi-objective evolutionary approaches. All five methods are compared with existing state-of-the-art methods. The results of the comparisons show that the newly developed text clustering methods out-perform existing methods by achieving up to 16\% improvement for some comparisons. In general, almost all newly developed clustering algorithms showed statistically significant improvements over other existing methods. The key ideas of the thesis highlight that exploiting user queries improves Search Result Clustering(SRC); utilizing rich features in weighting schemes and distance measures improves soft subspace clustering; utilizing multiple views and a multi-objective cluster oriented method improves clustering ensemble methods; and better evolutionary operators and objective functions improve multi-objective evolutionary clustering ensemble methods. The new text clustering methods introduced in this thesis can be widely applied in various domains that involve analysis of text data. The contributions of this thesis which include five new text clustering methods, will not only help researchers in the data mining field but also to help a wide range of researchers in other fields
    • 

    corecore