2,082 research outputs found
From random walks to distances on unweighted graphs
Large unweighted directed graphs are commonly used to capture relations
between entities. A fundamental problem in the analysis of such networks is to
properly define the similarity or dissimilarity between any two vertices.
Despite the significance of this problem, statistical characterization of the
proposed metrics has been limited. We introduce and develop a class of
techniques for analyzing random walks on graphs using stochastic calculus.
Using these techniques we generalize results on the degeneracy of hitting times
and analyze a metric based on the Laplace transformed hitting time (LTHT). The
metric serves as a natural, provably well-behaved alternative to the expected
hitting time. We establish a general correspondence between hitting times of
the Brownian motion and analogous hitting times on the graph. We show that the
LTHT is consistent with respect to the underlying metric of a geometric graph,
preserves clustering tendency, and remains robust against random addition of
non-geometric edges. Tests on simulated and real-world data show that the LTHT
matches theoretical predictions and outperforms alternatives.Comment: To appear in NIPS 201
Far-Field Compression for Fast Kernel Summation Methods in High Dimensions
We consider fast kernel summations in high dimensions: given a large set of
points in dimensions (with ) and a pair-potential function (the
{\em kernel} function), we compute a weighted sum of all pairwise kernel
interactions for each point in the set. Direct summation is equivalent to a
(dense) matrix-vector multiplication and scales quadratically with the number
of points. Fast kernel summation algorithms reduce this cost to log-linear or
linear complexity.
Treecodes and Fast Multipole Methods (FMMs) deliver tremendous speedups by
constructing approximate representations of interactions of points that are far
from each other. In algebraic terms, these representations correspond to
low-rank approximations of blocks of the overall interaction matrix. Existing
approaches require an excessive number of kernel evaluations with increasing
and number of points in the dataset.
To address this issue, we use a randomized algebraic approach in which we
first sample the rows of a block and then construct its approximate, low-rank
interpolative decomposition. We examine the feasibility of this approach
theoretically and experimentally. We provide a new theoretical result showing a
tighter bound on the reconstruction error from uniformly sampling rows than the
existing state-of-the-art. We demonstrate that our sampling approach is
competitive with existing (but prohibitively expensive) methods from the
literature. We also construct kernel matrices for the Laplacian, Gaussian, and
polynomial kernels -- all commonly used in physics and data analysis. We
explore the numerical properties of blocks of these matrices, and show that
they are amenable to our approach. Depending on the data set, our randomized
algorithm can successfully compute low rank approximations in high dimensions.
We report results for data sets with ambient dimensions from four to 1,000.Comment: 43 pages, 21 figure
-MLE: A fast algorithm for learning statistical mixture models
We describe -MLE, a fast and efficient local search algorithm for learning
finite statistical mixtures of exponential families such as Gaussian mixture
models. Mixture models are traditionally learned using the
expectation-maximization (EM) soft clustering technique that monotonically
increases the incomplete (expected complete) likelihood. Given prescribed
mixture weights, the hard clustering -MLE algorithm iteratively assigns data
to the most likely weighted component and update the component models using
Maximum Likelihood Estimators (MLEs). Using the duality between exponential
families and Bregman divergences, we prove that the local convergence of the
complete likelihood of -MLE follows directly from the convergence of a dual
additively weighted Bregman hard clustering. The inner loop of -MLE can be
implemented using any -means heuristic like the celebrated Lloyd's batched
or Hartigan's greedy swap updates. We then show how to update the mixture
weights by minimizing a cross-entropy criterion that implies to update weights
by taking the relative proportion of cluster points, and reiterate the mixture
parameter update and mixture weight update processes until convergence. Hard EM
is interpreted as a special case of -MLE when both the component update and
the weight update are performed successively in the inner loop. To initialize
-MLE, we propose -MLE++, a careful initialization of -MLE guaranteeing
probabilistically a global bound on the best possible complete likelihood.Comment: 31 pages, Extend preliminary paper presented at IEEE ICASSP 201
Laplacian Mixture Modeling for Network Analysis and Unsupervised Learning on Graphs
Laplacian mixture models identify overlapping regions of influence in
unlabeled graph and network data in a scalable and computationally efficient
way, yielding useful low-dimensional representations. By combining Laplacian
eigenspace and finite mixture modeling methods, they provide probabilistic or
fuzzy dimensionality reductions or domain decompositions for a variety of input
data types, including mixture distributions, feature vectors, and graphs or
networks. Provable optimal recovery using the algorithm is analytically shown
for a nontrivial class of cluster graphs. Heuristic approximations for scalable
high-performance implementations are described and empirically tested.
Connections to PageRank and community detection in network analysis demonstrate
the wide applicability of this approach. The origins of fuzzy spectral methods,
beginning with generalized heat or diffusion equations in physics, are reviewed
and summarized. Comparisons to other dimensionality reduction and clustering
methods for challenging unsupervised machine learning problems are also
discussed.Comment: 13 figures, 35 reference
Spatio-Temporal Cluster Detection and Local Moran Statistics of Point Processes
Moran\u27s index is a statistic that measures spatial dependence, quantifying the degree of dispersion or clustering of point processes and events in some location/area. Recognizing that a single Moran\u27s index may not give a sufficient summary of the spatial autocorrelation measure, a local indicator of spatial association (LISA) has gained popularity. Accordingly, we propose extending LISAs to time after partitioning the area and computing a Moran-type statistic for each subarea. Patterns between the local neighbors are unveiled that would not otherwise be apparent. We consider the measures of Moran statistics while incorporating a time factor under simulated multilevel Palm distribution, a generalized Poisson phenomenon where the clusters and dependence among the subareas are captured by the rate of increase of the process over time. Event propagation is built under spatial nested sequences over time. The Palm parameters, Moran statistics and convergence criteria are calculated from an explicit algorithm in a Markov chain Monte Carlo simulation setting and further analyzed in two real datasets
Optical signatures of quantum delocalization over extended domains in photosynthetic membranes
The prospect of coherent dynamics and excitonic delocalization across several
light-harvesting structures in photosynthetic membranes is of considerable
interest, but challenging to explore experimentally. Here we demonstrate
theoretically that the excitonic delocalization across extended domains
involving several light-harvesting complexes can lead to unambiguous signatures
in the optical response, specifically, linear absorption spectra. We
characterize, under experimentally established conditions of molecular assembly
and protein-induced inhomogeneities, the optical absorption in these arrays
from polarized and unpolarized excitation, and demonstrate that it can be used
as a diagnostic tool to determine the coherent coupling among iso-energetic
light-harvesting structures. The knowledge of these couplings would then
provide further insight into the dynamical properties of transfer, such as
facilitating the accurate determination of F\"orster rates.Comment: 4 figures and Supplementary information with 7 figures. To appear in
Journal of physical chemistry A, 201
Exact algorithms for -TV regularization of real-valued or circle-valued signals
We consider -TV regularization of univariate signals with values on the
real line or on the unit circle. While the real data space leads to a convex
optimization problem, the problem is non-convex for circle-valued data. In this
paper, we derive exact algorithms for both data spaces. A key ingredient is the
reduction of the infinite search spaces to a finite set of configurations,
which can be scanned by the Viterbi algorithm. To reduce the computational
complexity of the involved tabulations, we extend the technique of distance
transforms to non-uniform grids and to the circular data space. In total, the
proposed algorithms have complexity where is the length
of the signal and is the number of different values in the data set. In
particular, the complexity is for quantized data. It is the
first exact algorithm for TV regularization with circle-valued data, and it is
competitive with the state-of-the-art methods for scalar data, assuming that
the latter are quantized
Inverse Density as an Inverse Problem: The Fredholm Equation Approach
In this paper we address the problem of estimating the ratio
where is a density function and is another density, or, more generally
an arbitrary function. Knowing or approximating this ratio is needed in various
problems of inference and integration, in particular, when one needs to average
a function with respect to one probability distribution, given a sample from
another. It is often referred as {\it importance sampling} in statistical
inference and is also closely related to the problem of {\it covariate shift}
in transfer learning as well as to various MCMC methods. It may also be useful
for separating the underlying geometry of a space, say a manifold, from the
density function defined on it.
Our approach is based on reformulating the problem of estimating
as an inverse problem in terms of an integral operator
corresponding to a kernel, and thus reducing it to an integral equation, known
as the Fredholm problem of the first kind. This formulation, combined with the
techniques of regularization and kernel methods, leads to a principled
kernel-based framework for constructing algorithms and for analyzing them
theoretically.
The resulting family of algorithms (FIRE, for Fredholm Inverse Regularized
Estimator) is flexible, simple and easy to implement.
We provide detailed theoretical analysis including concentration bounds and
convergence rates for the Gaussian kernel in the case of densities defined on
, compact domains in and smooth -dimensional sub-manifolds of
the Euclidean space.
We also show experimental results including applications to classification
and semi-supervised learning within the covariate shift framework and
demonstrate some encouraging experimental comparisons. We also show how the
parameters of our algorithms can be chosen in a completely unsupervised manner.Comment: Fixing a few typos in last versio
- …