1,132,767 research outputs found

    Stochastic Data Clustering

    Full text link
    In 1961 Herbert Simon and Albert Ando published the theory behind the long-term behavior of a dynamical system that can be described by a nearly uncoupled matrix. Over the past fifty years this theory has been used in a variety of contexts, including queueing theory, brain organization, and ecology. In all these applications, the structure of the system is known and the point of interest is the various stages the system passes through on its way to some long-term equilibrium. This paper looks at this problem from the other direction. That is, we develop a technique for using the evolution of the system to tell us about its initial structure, and we use this technique to develop a new algorithm for data clustering.Comment: 23 page

    Info-Clustering: A Mathematical Theory for Data Clustering

    Full text link
    We formulate an info-clustering paradigm based on a multivariate information measure, called multivariate mutual information, that naturally extends Shannon's mutual information between two random variables to the multivariate case involving more than two random variables. With proper model reductions, we show that the paradigm can be applied to study the human genome and connectome in a more meaningful way than the conventional algorithmic approach. Not only can info-clustering provide justifications and refinements to some existing techniques, but it also inspires new computationally feasible solutions.Comment: In celebration of Claude Shannon's Centenar

    Spectral Clustering with Imbalanced Data

    Full text link
    Spectral clustering is sensitive to how graphs are constructed from data particularly when proximal and imbalanced clusters are present. We show that Ratio-Cut (RCut) or normalized cut (NCut) objectives are not tailored to imbalanced data since they tend to emphasize cut sizes over cut values. We propose a graph partitioning problem that seeks minimum cut partitions under minimum size constraints on partitions to deal with imbalanced data. Our approach parameterizes a family of graphs, by adaptively modulating node degrees on a fixed node set, to yield a set of parameter dependent cuts reflecting varying levels of imbalance. The solution to our problem is then obtained by optimizing over these parameters. We present rigorous limit cut analysis results to justify our approach. We demonstrate the superiority of our method through unsupervised and semi-supervised experiments on synthetic and real data sets.Comment: 24 pages, 7 figures. arXiv admin note: substantial text overlap with arXiv:1302.513

    Analyzing and clustering neural data

    Get PDF
    This thesis aims to analyze neural data in an overall effort by the Charles Stark Draper Laboratory to determine an underlying pattern in brain activity in healthy individuals versus patients with a brain degenerative disorder. The neural data comes from ECoG (electrocorticography) applied to either humans or primates. Each ECoG array has electrodes that measure voltage variations which neuroscientists claim correlates to neurons transmitting signals to one another. ECoG differs from the less invasive technique of EEG (electroencephalography) in that EEG electrodes are placed above a patients scalp while ECoG involves drilling small holes in the skull to allow electrodes to be closer to the brain. Because of this ECoG boasts an exceptionally high signal-to-noise ratio and less susceptibility to artifacts than EEG [6]. While wearing the ECoG caps, the patients are asked to perform a range of different tasks. The tasks performed by patients are partitioned into different levels of mental stress i.e. how much concentration is presumably required. The specific dataset used in this thesis is derived from cognitive behavior experiments performed on primates at MGH (Massachusetts General Hospital). The content of this thesis can be thought of as a pipelined process. First the data is collected from the ECoG electrodes, then the data is pre-processed via signal processing techniques and finally the data is clustered via unsupervised learning techniques. For both the pre-processing and the clustering steps, different techniques are applied and then compared against one another. The focus of this thesis is to evaluate clustering techniques when applied to neural data. For the pre-processing step, two types of bandpass filters, a Butterworth Filter and a Chebyshev Filter were applied. For the clustering step three techniques were applied to the data, K-means Clustering, Spectral Clustering and Self-Tuning Spectral Clustering. We conclude that for pre-processing the results from both filters are very similar and thus either filter is sufficient. For clustering we conclude that K- means has the lowest amount of overlap between clusters. K-means is also the most time-efficient of the three techniques and is thus the ideal choice for this application.2016-10-27T00:00:00

    Factor PD-Clustering

    Full text link
    Factorial clustering methods have been developed in recent years thanks to the improving of computational power. These methods perform a linear transformation of data and a clustering on transformed data optimizing a common criterion. Factorial PD-clustering is based on Probabilistic Distance clustering (PD-clustering). PD-clustering is an iterative, distribution free, probabilistic, clustering method. Factor PD-clustering make a linear transformation of original variables into a reduced number of orthogonal ones using a common criterion with PD-Clustering. It is demonstrated that Tucker 3 decomposition allows to obtain this transformation. Factor PD-clustering makes alternatively a Tucker 3 decomposition and a PD-clustering on transformed data until convergence. This method could significantly improve the algorithm performance and allows to work with large dataset, to improve the stability and the robustness of the method

    Co-clustering separately exchangeable network data

    Full text link
    This article establishes the performance of stochastic blockmodels in addressing the co-clustering problem of partitioning a binary array into subsets, assuming only that the data are generated by a nonparametric process satisfying the condition of separate exchangeability. We provide oracle inequalities with rate of convergence OP(n1/4)\mathcal{O}_P(n^{-1/4}) corresponding to profile likelihood maximization and mean-square error minimization, and show that the blockmodel can be interpreted in this setting as an optimal piecewise-constant approximation to the generative nonparametric model. We also show for large sample sizes that the detection of co-clusters in such data indicates with high probability the existence of co-clusters of equal size and asymptotically equivalent connectivity in the underlying generative process.Comment: Published in at http://dx.doi.org/10.1214/13-AOS1173 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org
    corecore