3 research outputs found

    Algorithmic advances in learning from large dimensional matrices and scientific data

    Get PDF
    University of Minnesota Ph.D. dissertation.May 2018. Major: Computer Science. Advisor: Yousef Saad. 1 computer file (PDF); xi, 196 pages.This thesis is devoted to answering a range of questions in machine learning and data analysis related to large dimensional matrices and scientific data. Two key research objectives connect the different parts of the thesis: (a) development of fast, efficient, and scalable algorithms for machine learning which handle large matrices and high dimensional data; and (b) design of learning algorithms for scientific data applications. The work combines ideas from multiple, often non-traditional, fields leading to new algorithms, new theory, and new insights in different applications. The first of the three parts of this thesis explores numerical linear algebra tools to develop efficient algorithms for machine learning with reduced computation cost and improved scalability. Here, we first develop inexpensive algorithms combining various ideas from linear algebra and approximation theory for matrix spectrum related problems such as numerical rank estimation, matrix function trace estimation including log-determinants, Schatten norms, and other spectral sums. We also propose a new method which simultaneously estimates the dimension of the dominant subspace of covariance matrices and obtains an approximation to the subspace. Next, we consider matrix approximation problems such as low rank approximation, column subset selection, and graph sparsification. We present a new approach based on multilevel coarsening to compute these approximations for large sparse matrices and graphs. Lastly, on the linear algebra front, we devise a novel algorithm based on rank shrinkage for the dictionary learning problem, learning a small set of dictionary columns which best represent the given data. The second part of this thesis focuses on exploring novel non-traditional applications of information theory and codes, particularly in solving problems related to machine learning and high dimensional data analysis. Here, we first propose new matrix sketching methods using codes for obtaining low rank approximations of matrices and solving least squares regression problems. Next, we demonstrate that codewords from certain coding scheme perform exceptionally well for the group testing problem. Lastly, we present a novel machine learning application for coding theory, that of solving large scale multilabel classification problems. We propose a new algorithm for multilabel classification which is based on group testing and codes. The algorithm has a simple inexpensive prediction method, and the error correction capabilities of codes are exploited for the first time to correct prediction errors. The third part of the thesis focuses on devising robust and stable learning algorithms, which yield results that are interpretable from specific scientific application viewpoint. We present Union of Intersections (UoI), a flexible, modular, and scalable framework for statistical-machine learning problems. We then adapt this framework to develop new algorithms for matrix decomposition problems such as nonnegative matrix factorization (NMF) and CUR decomposition. We apply these new methods to data from Neuroscience applications in order to obtain insights into the functionality of the brain. Finally, we consider the application of material informatics, learning from materials data. Here, we deploy regression techniques on materials data to predict physical properties of materials

    A Multi-level Approach for Document clustering

    No full text
    Abstract. The divisive MinMaxCut algorithm of Ding et al. [3] produces more accurate clustering results than existing document cluster methods. Multilevel algorithms [4, 1, 5, 7] have been used to boost the speed of graph partitioning algorithms. We combine these two algorithms to construct faster and more accurate algorithm. In this new algorithm, the original graph is coarsened, partitioned by the divisive MinMaxCut algorithm and then decoarsened. A refining algorithm is also applied to improve the accuracy at each level. 1 Introduction Clustering is the task of classifying a collection of objects, such as documents, into natural categories. Diagnostic tasks in medicine involve classifying a set of symptoms according to the cause and treatment methods. Data mining frequently involves separating documents into different categories so as to provide only relevant information for a user's query. We do not expect that classification or clustering algorithms will ever be 100 % accurate, and the related optimization problems are frequently NP-hard. However, we do expect to find algorithms which are highly accurate that are nevertheless very efficient for practical clustering problems. A variety of algorithms have been proposed and are in widespread use, including the K-means method and its variants [9], the Ratio Cut method [4], and the Normalized Cut method [12]. A recent method that has been proposed is based on the computation of eigenvectors which is called the MinMaxCut algorithm proposed by Ding et al. [3]
    corecore