606 research outputs found

    Dimensionality Reduction Mappings

    Get PDF
    A wealth of powerful dimensionality reduction methods has been established which can be used for data visualization and preprocessing. These are accompanied by formal evaluation schemes, which allow a quantitative evaluation along general principles and which even lead to further visualization schemes based on these objectives. Most methods, however, provide a mapping of a priorly given finite set of points only, requiring additional steps for out-of-sample extensions. We propose a general view on dimensionality reduction based on the concept of cost functions, and, based on this general principle, extend dimensionality reduction to explicit mappings of the data manifold. This offers simple out-of-sample extensions. Further, it opens a way towards a theory of data visualization taking the perspective of its generalization ability to new data points. We demonstrate the approach based on a simple global linear mapping as well as prototype-based local linear mappings.

    Masking Strategies for Image Manifolds

    Full text link
    We consider the problem of selecting an optimal mask for an image manifold, i.e., choosing a subset of the pixels of the image that preserves the manifold's geometric structure present in the original data. Such masking implements a form of compressive sensing through emerging imaging sensor platforms for which the power expense grows with the number of pixels acquired. Our goal is for the manifold learned from masked images to resemble its full image counterpart as closely as possible. More precisely, we show that one can indeed accurately learn an image manifold without having to consider a large majority of the image pixels. In doing so, we consider two masking methods that preserve the local and global geometric structure of the manifold, respectively. In each case, the process of finding the optimal masking pattern can be cast as a binary integer program, which is computationally expensive but can be approximated by a fast greedy algorithm. Numerical experiments show that the relevant manifold structure is preserved through the data-dependent masking process, even for modest mask sizes

    Data Clustering And Visualization Through Matrix Factorization

    Get PDF
    Clustering is traditionally an unsupervised task which is to find natural groupings or clusters in multidimensional data based on perceived similarities among the patterns. The purpose of clustering is to extract useful information from unlabeled data. In order to present the extracted useful knowledge obtained by clustering in a meaningful way, data visualization becomes a popular and growing area of research field. Visualization can provide a qualitative overview of large and complex data sets, which help us the desired insight in truly understanding the phenomena of interest in data. The contribution of this dissertation is two-fold: Semi-Supervised Non-negative Matrix Factorization (SS-NMF) for data clustering/co-clustering and Exemplar-based data Visualization (EV) through matrix factorization. Compared to traditional data mining models, matrix-based methods are fast, easy to understand and implement, especially suitable to solve large-scale challenging problems in text mining, image grouping, medical diagnosis, and bioinformatics. In this dissertation, we present two effective matrix-based solutions in the new directions of data clustering and visualization. First, in many practical learning domains, there is a large supply of unlabeled data but limited labeled data, and in most cases it might be expensive to generate large amounts of labeled data. Traditional clustering algorithms completely ignore these valuable labeled data and thus are inapplicable to these problems. Consequently, semi-supervised clustering, which can incorporate the domain knowledge to guide a clustering algorithm, has become a topic of significant recent interest. Thus, we develop a Non-negative Matrix Factorization (NMF) based framework to incorporate prior knowledge into data clustering. Moreover, with the fast growth of Internet and computational technologies in the past decade, many data mining applications have advanced swiftly from the simple clustering of one data type to the co-clustering of multiple data types, usually involving high heterogeneity. To this end, we extend SS-NMF to perform heterogeneous data co-clustering. From a theoretical perspective, SS-NMF for data clustering/co-clustering is mathematically rigorous. The convergence and correctness of our algorithms are proved. In addition, we discuss the relationship between SS-NMF with other well-known clustering and co-clustering models. Second, most of current clustering models only provide the centroids (e.g., mathematical means of the clusters) without inferring the representative exemplars from real data, thus they are unable to better summarize or visualize the raw data. A new method, Exemplar-based Visualization (EV), is proposed to cluster and visualize an extremely large-scale data. Capitalizing on recent advances in matrix approximation and factorization, EV provides a means to visualize large scale data with high accuracy (in retaining neighbor relations), high efficiency (in computation), and high flexibility (through the use of exemplars). Empirically, we demonstrate the superior performance of our matrix-based data clustering and visualization models through extensive experiments performed on the publicly available large scale data sets

    Effective and Trustworthy Dimensionality Reduction Approaches for High Dimensional Data Understanding and Visualization

    Get PDF
    In recent years, the huge expansion of digital technologies has vastly increased the volume of data to be explored. Reducing the dimensionality of data is an essential step in data exploration and visualisation. The integrity of a dimensionality reduction technique relates to the goodness of maintaining the data structure. The visualisation of a low dimensional data that has not captured the high dimensional space data structure is untrustworthy. The scale of maintained data structure by a method depends on several factors, such as the type of data considered and tuning parameters. The type of the data includes linear and nonlinear data, and the tuning parameters include the number of neighbours and perplexity. In reality, most of the data under consideration are nonlinear, and the process to tune parameters could be costly since it depends on the number of data samples considered. Currently, the existing dimensionality reduction approaches suffer from the following problems: 1) Only work well with linear data, 2) The scale of maintained data structure is related to the number of data samples considered, and/or 3) Tear problem and false neighbours problem.To deal with all the above-mentioned problems, this research has developed Same Degree Distribution (SDD), multi-SDD (MSDD) and parameter-free SDD approaches , that 1) Saves computational time because its tuning parameter does not 2) Produces more trustworthy visualisation by using degree-distribution that is smooth enough to capture local and global data structure, and 3) Does not suffer from tear and false neighbours problems due to using the same degree-distribution in the high and low dimensional spaces to calculate the similarities between data samples. The developed dimensionality reduction methods are tested with several popu- lar synthetics and real datasets. The scale of the maintained data structure is evaluated using different quality metrics, i.e., Kendall’s Tau coefficient, Trustworthiness, Continuity, LCMC, and Co-ranking matrix. Also, the theoretical analysis of the impact of dissimilarity measure in structure capturing has been supported by simulations results conducted in two different datasets evaluated by Kendall’s Tau and Co-ranking matrix. The SDD, MSDD, and parameter-free SDD methods do not outperform other global methods such as Isomap in data with a large fraction of large pairwise distances, and it remains a further work task. Reducing the computational complexity is another objective for further work
    corecore