1,132,767 research outputs found
Stochastic Data Clustering
In 1961 Herbert Simon and Albert Ando published the theory behind the
long-term behavior of a dynamical system that can be described by a nearly
uncoupled matrix. Over the past fifty years this theory has been used in a
variety of contexts, including queueing theory, brain organization, and
ecology. In all these applications, the structure of the system is known and
the point of interest is the various stages the system passes through on its
way to some long-term equilibrium.
This paper looks at this problem from the other direction. That is, we
develop a technique for using the evolution of the system to tell us about its
initial structure, and we use this technique to develop a new algorithm for
data clustering.Comment: 23 page
Info-Clustering: A Mathematical Theory for Data Clustering
We formulate an info-clustering paradigm based on a multivariate information
measure, called multivariate mutual information, that naturally extends
Shannon's mutual information between two random variables to the multivariate
case involving more than two random variables. With proper model reductions, we
show that the paradigm can be applied to study the human genome and connectome
in a more meaningful way than the conventional algorithmic approach. Not only
can info-clustering provide justifications and refinements to some existing
techniques, but it also inspires new computationally feasible solutions.Comment: In celebration of Claude Shannon's Centenar
Spectral Clustering with Imbalanced Data
Spectral clustering is sensitive to how graphs are constructed from data
particularly when proximal and imbalanced clusters are present. We show that
Ratio-Cut (RCut) or normalized cut (NCut) objectives are not tailored to
imbalanced data since they tend to emphasize cut sizes over cut values. We
propose a graph partitioning problem that seeks minimum cut partitions under
minimum size constraints on partitions to deal with imbalanced data. Our
approach parameterizes a family of graphs, by adaptively modulating node
degrees on a fixed node set, to yield a set of parameter dependent cuts
reflecting varying levels of imbalance. The solution to our problem is then
obtained by optimizing over these parameters. We present rigorous limit cut
analysis results to justify our approach. We demonstrate the superiority of our
method through unsupervised and semi-supervised experiments on synthetic and
real data sets.Comment: 24 pages, 7 figures. arXiv admin note: substantial text overlap with
arXiv:1302.513
Analyzing and clustering neural data
This thesis aims to analyze neural data in an overall effort by the Charles Stark
Draper Laboratory to determine an underlying pattern in brain activity in healthy
individuals versus patients with a brain degenerative disorder. The neural data comes from ECoG (electrocorticography) applied to either humans or primates. Each ECoG array has electrodes that measure voltage variations which neuroscientists claim correlates to neurons transmitting signals to one another. ECoG differs from the less invasive technique of EEG (electroencephalography) in that EEG electrodes are placed above a patients scalp while ECoG involves drilling small holes in the skull to allow electrodes to be closer to the brain. Because of this ECoG boasts an exceptionally high signal-to-noise ratio and less susceptibility to artifacts than EEG [6]. While wearing the ECoG caps, the patients are asked to perform a range of different tasks.
The tasks performed by patients are partitioned into different levels of mental stress
i.e. how much concentration is presumably required. The specific dataset used in
this thesis is derived from cognitive behavior experiments performed on primates at
MGH (Massachusetts General Hospital).
The content of this thesis can be thought of as a pipelined process. First the
data is collected from the ECoG electrodes, then the data is pre-processed via signal processing techniques and finally the data is clustered via unsupervised learning techniques. For both the pre-processing and the clustering steps, different techniques are applied and then compared against one another. The focus of this thesis is to evaluate clustering techniques when applied to neural data.
For the pre-processing step, two types of bandpass filters, a Butterworth Filter
and a Chebyshev Filter were applied. For the clustering step three techniques were
applied to the data, K-means Clustering, Spectral Clustering and Self-Tuning Spectral Clustering. We conclude that for pre-processing the results from both filters are very similar and thus either filter is sufficient. For clustering we conclude that K- means has the lowest amount of overlap between clusters. K-means is also the most time-efficient of the three techniques and is thus the ideal choice for this application.2016-10-27T00:00:00
Factor PD-Clustering
Factorial clustering methods have been developed in recent years thanks to
the improving of computational power. These methods perform a linear
transformation of data and a clustering on transformed data optimizing a common
criterion. Factorial PD-clustering is based on Probabilistic Distance
clustering (PD-clustering). PD-clustering is an iterative, distribution free,
probabilistic, clustering method. Factor PD-clustering make a linear
transformation of original variables into a reduced number of orthogonal ones
using a common criterion with PD-Clustering. It is demonstrated that Tucker 3
decomposition allows to obtain this transformation. Factor PD-clustering makes
alternatively a Tucker 3 decomposition and a PD-clustering on transformed data
until convergence. This method could significantly improve the algorithm
performance and allows to work with large dataset, to improve the stability and
the robustness of the method
Co-clustering separately exchangeable network data
This article establishes the performance of stochastic blockmodels in
addressing the co-clustering problem of partitioning a binary array into
subsets, assuming only that the data are generated by a nonparametric process
satisfying the condition of separate exchangeability. We provide oracle
inequalities with rate of convergence corresponding
to profile likelihood maximization and mean-square error minimization, and show
that the blockmodel can be interpreted in this setting as an optimal
piecewise-constant approximation to the generative nonparametric model. We also
show for large sample sizes that the detection of co-clusters in such data
indicates with high probability the existence of co-clusters of equal size and
asymptotically equivalent connectivity in the underlying generative process.Comment: Published in at http://dx.doi.org/10.1214/13-AOS1173 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
- …
