187 research outputs found
Graph Optimal Transport with Transition Couplings of Random Walks
We present a novel approach to optimal transport between graphs from the
perspective of stationary Markov chains. A weighted graph may be associated
with a stationary Markov chain by means of a random walk on the vertex set with
transition distributions depending on the edge weights of the graph. After
drawing this connection, we describe how optimal transport techniques for
stationary Markov chains may be used in order to perform comparison and
alignment of the graphs under study. In particular, we propose the graph
optimal transition coupling problem, referred to as GraphOTC, in which the
Markov chains associated to two given graphs are optimally synchronized to
minimize an expected cost. The joint synchronized chain yields an alignment of
the vertices and edges in the two graphs, and the expected cost of the
synchronized chain acts as a measure of distance or dissimilarity between the
two graphs. We demonstrate that GraphOTC performs equal to or better than
existing state-of-the-art techniques in graph optimal transport for several
tasks and datasets. Finally, we also describe a generalization of the GraphOTC
problem, called the FusedOTC problem, from which we recover the GraphOTC and OT
costs as special cases
Novel Algorithms and Datamining for Clustering Massive Datasets
Clustering proteomics data is a challenging problem for any traditional clustering algorithm. Usually, the number of samples is much smaller than the number of protein peaks. The use of a clustering algorithm which does not take into consideration the number of feature of variables (here the number of peaks) is needed. An innovative hierarchical clustering algorithm may be a good approach. This work proposes a new dissimilarity measure for the hierarchical clustering combined with a functional data analysis. This work presents a specific application of functional data analysis (FDA) to a highthrouput proteomics study. The high performance of the proposed algorithm is compared to two popular dissimilarity measures in the clustering of normal and Human T Cell Leukemia Virus Type 1 (HTLV-1)-infected patients samples.
The difficulty in clustering spatial data is that the data is multi - dimensional and massive. Sometimes, an automated clustering algorithm may not be sufficient to cluster this type of data. An iterative clustering algorithm along with the capability of visual steering may be a good approach. This case study proposes a new iterative algorithm which is the combination of automated clustering methods like the bayesian clustering, detection of multivariate outliers, and the visual clustering. Simulated data from a plasma experiment and real astronomical data are used to test the performance of the algorithm
Probability approximations with applications in computational finance and computational biology
In this work, certain probability approximation schemes are applied to two different contexts: one under stochastic volatility models in financial econometrics and the other about the hierarchical clustering of directional data on the unit (hyper)sphere. In both cases, approximations play an important role in improving the computational efficiency. In the first part, we study stochastic volatility models. As an indispensable part of Bayesian inference using MCMC, we need to compute the option prices for each iteration at each time. To facilitate the computation, an approximation scheme is proposed for numerical computation of the option prices based on a central limit theorem, and some error bounds for the approximations are obtained. The second part of the work originates from studying microarray data. After pre-processing the microarray data, each gene is represented by a unit vector. To study their patterns, we adopt hierarchical clustering and introduce the idea of linking by the size of a spherical cap. In this way, each cluster is represented by a spherical cap. By studying the distribution of direction data on the unit (hyper)sphere, we can assess the significance of observing a big cluster using Poisson approximations
On partitioning multivariate self-affine time series
Given a multivariate time series, possibly of high dimension, with unknown and time-varying joint distribution, it is of interest to be able to completely partition the time series into disjoint, contiguous subseries, each of which has different distributional or pattern attributes from the preceding and succeeding subseries. An additional feature of many time series is that they display self-affinity, so that subseries at one time scale are similar to subseries at another after application of an affine transformation. Such qualities are observed in time series from many disciplines, including biology, medicine, economics, finance, and computer science. This paper defines the relevant multiobjective combinatorial optimization problem with limited assumptions as a biobjective one, and a specialized evolutionary algorithm is presented which finds optimal self-affine time series partitionings with a minimum of choice parameters. The algorithm not only finds partitionings for all possible numbers of partitions given data constraints, but also for self-affinities between these partitionings and some fine-grained partitioning. The resulting set of Pareto-efficient solution sets provides a rich representation of the self-affine properties of a multivariate time series at different locations and time scales
Measuring integrated information: comparison of candidate measures in theory and simulation
Integrated Information Theory (IIT) is a prominent theory of consciousness that has at its centre measures that quantify the extent to which a system generates more information than the sum of its parts. While several candidate measures of integrated information (‘Φ’) now exist, little is known about how they compare, especially in terms of their behaviour on non-trivial network models. In this article we provide clear and intuitive descriptions of six distinct candidate measures. We then explore the properties of each of these measures in simulation on networks consisting of eight interacting nodes, animated with Gaussian linear autoregressive dynamics. We find a striking diversity in the behaviour of these measures – no two measures show consistent agreement across all analyses. Further, only a subset of the measures appear to genuinely reflect some form of dynamical complexity, in the sense of simultaneous segregation and integration between system components. Our results help guide the operationalisation of IIT and advance the development of measures of integrated information that may have more general applicability
- …