4,216 research outputs found

    Dimensionality reduction of clustered data sets

    Get PDF
    We present a novel probabilistic latent variable model to perform linear dimensionality reduction on data sets which contain clusters. We prove that the maximum likelihood solution of the model is an unsupervised generalisation of linear discriminant analysis. This provides a completely new approach to one of the most established and widely used classification algorithms. The performance of the model is then demonstrated on a number of real and artificial data sets

    Machine Learning in Automated Text Categorization

    Full text link
    The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert manpower, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey

    Understanding Health and Disease with Multidimensional Single-Cell Methods

    Full text link
    Current efforts in the biomedical sciences and related interdisciplinary fields are focused on gaining a molecular understanding of health and disease, which is a problem of daunting complexity that spans many orders of magnitude in characteristic length scales, from small molecules that regulate cell function to cell ensembles that form tissues and organs working together as an organism. In order to uncover the molecular nature of the emergent properties of a cell, it is essential to measure multiple cell components simultaneously in the same cell. In turn, cell heterogeneity requires multiple cells to be measured in order to understand health and disease in the organism. This review summarizes current efforts towards a data-driven framework that leverages single-cell technologies to build robust signatures of healthy and diseased phenotypes. While some approaches focus on multicolor flow cytometry data and other methods are designed to analyze high-content image-based screens, we emphasize the so-called Supercell/SVM paradigm (recently developed by the authors of this review and collaborators) as a unified framework that captures mesoscopic-scale emergence to build reliable phenotypes. Beyond their specific contributions to basic and translational biomedical research, these efforts illustrate, from a larger perspective, the powerful synergy that might be achieved from bringing together methods and ideas from statistical physics, data mining, and mathematics to solve the most pressing problems currently facing the life sciences.Comment: 25 pages, 7 figures; revised version with minor changes. To appear in J. Phys.: Cond. Mat

    Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis

    Get PDF
    AbstractThe advent of high throughput technologies, in particular microarrays, for biological research has revived interest in clustering, resulting in a plethora of new clustering algorithms. However, model selection, i.e., the identification of the correct number of clusters in a dataset, has received relatively little attention. Indeed, although central for statistics, its difficulty is also well known. Fortunately, a few novel techniques for model selection, representing a sharp departure from previous ones in statistics, have been proposed and gained prominence for microarray data analysis. Among those, the stability-based methods are the most robust and best performing in terms of prediction, but the slowest in terms of time. It is very unfortunate that as fascinating and classic an area of statistics as model selection, with important practical applications, has received very little attention in terms of algorithmic design and engineering. In this paper, in order to partially fill this gap, we make the following contributions: (A) the first general algorithmic paradigm for stability-based methods for model selection; (B) reductions showing that all of the known methods in this class are an instance of the proposed paradigm; (C) a novel algorithmic paradigm for the class of stability-based methods for cluster validity, i.e., methods assessing how statistically significant is a given clustering solution; (D) a general algorithmic paradigm that describes heuristic and very effective speed-ups known in the literature for stability-based model selection methods.Since the performance evaluation of model selection algorithms is mainly experimental, we offer, for completeness and without even attempting to be exhaustive, a representative synopsis of known experimental benchmarking results that highlight the ability of stability-based methods for model selection and the computational resources that they require for the task. As a whole, the contributions of this paper generalize in several respects reference methodologies in statistics and show that algorithmic approaches can yield deep methodological insights into this area, in addition to practical computational procedures

    Improving Efficiency in Deep Learning for Large Scale Visual Recognition

    Get PDF
    The emerging recent large scale visual recognition methods, and in particular the deep Convolutional Neural Networks (CNN), are promising to revolutionize many computer vision based artificial intelligent applications, such as autonomous driving and online image retrieval systems. One of the main challenges in large scale visual recognition is the complexity of the corresponding algorithms. This is further exacerbated by the fact that in most real-world scenarios they need to run in real time and on platforms that have limited computational resources. This dissertation focuses on improving the efficiency of such large scale visual recognition algorithms from several perspectives. First, to reduce the complexity of large scale classification to sub-linear with the number of classes, a probabilistic label tree framework is proposed. A test sample is classified by traversing the label tree from the root node. Each node in the tree is associated with a probabilistic estimation of all the labels. The tree is learned recursively with iterative maximum likelihood optimization. Comparing to the hard label partition proposed previously, the probabilistic framework performs classification more accurately with similar efficiency. Second, we explore the redundancy of parameters in Convolutional Neural Networks (CNN) and employ sparse decomposition to significantly reduce both the amount of parameters and computational complexity. Both inter-channel and inner-channel redundancy is exploit to achieve more than 90\% sparsity with approximately 1\% drop of classification accuracy. We also propose a CPU based efficient sparse matrix multiplication algorithm to reduce the actual running time of CNN models with sparse convolutional kernels. Third, we propose a multi-stage framework based on CNN to achieve better efficiency than a single traditional CNN model. With a combination of cascade model and the label tree framework, the proposed method divides the input images in both the image space and the label space, and processes each image with CNN models that are most suitable and efficient. The average complexity of the framework is significantly reduced, while the overall accuracy remains the same as in the single complex model
    • …
    corecore