89 research outputs found
Changepoint Detection over Graphs with the Spectral Scan Statistic
We consider the change-point detection problem of deciding, based on noisy
measurements, whether an unknown signal over a given graph is constant or is
instead piecewise constant over two connected induced subgraphs of relatively
low cut size. We analyze the corresponding generalized likelihood ratio (GLR)
statistics and relate it to the problem of finding a sparsest cut in a graph.
We develop a tractable relaxation of the GLR statistic based on the
combinatorial Laplacian of the graph, which we call the spectral scan
statistic, and analyze its properties. We show how its performance as a testing
procedure depends directly on the spectrum of the graph, and use this result to
explicitly derive its asymptotic properties on few significant graph
topologies. Finally, we demonstrate both theoretically and by simulations that
the spectral scan statistic can outperform naive testing procedures based on
edge thresholding and testing
Sequential Changepoint Approach for Online Community Detection
We present new algorithms for detecting the emergence of a community in large
networks from sequential observations. The networks are modeled using
Erdos-Renyi random graphs with edges forming between nodes in the community
with higher probability. Based on statistical changepoint detection
methodology, we develop three algorithms: the Exhaustive Search (ES), the
mixture, and the Hierarchical Mixture (H-Mix) methods. Performance of these
methods is evaluated by the average run length (ARL), which captures the
frequency of false alarms, and the detection delay. Numerical comparisons show
that the ES method performs the best; however, it is exponentially complex. The
mixture method is polynomially complex by exploiting the fact that the size of
the community is typically small in a large network. However, it may react to a
group of active edges that do not form a community. This issue is resolved by
the H-Mix method, which is based on a dendrogram decomposition of the network.
We present an asymptotic analytical expression for ARL of the mixture method
when the threshold is large. Numerical simulation verifies that our
approximation is accurate even in the non-asymptotic regime. Hence, it can be
used to determine a desired threshold efficiently. Finally, numerical examples
show that the mixture and the H-Mix methods can both detect a community quickly
with a lower complexity than the ES method.Comment: Submitted to 2014 INFORMS Workshop on Data Mining and Analytics and
an IEEE journa
Unsupervised robust nonparametric learning of hidden community properties
We consider learning of fundamental properties of communities in large noisy
networks, in the prototypical situation where the nodes or users are split into
two classes according to a binary property, e.g., according to their opinions
or preferences on a topic. For learning these properties, we propose a
nonparametric, unsupervised, and scalable graph scan procedure that is, in
addition, robust against a class of powerful adversaries. In our setup, one of
the communities can fall under the influence of a knowledgeable adversarial
leader, who knows the full network structure, has unlimited computational
resources and can completely foresee our planned actions on the network. We
prove strong consistency of our results in this setup with minimal assumptions.
In particular, the learning procedure estimates the baseline activity of normal
users asymptotically correctly with probability 1; the only assumption being
the existence of a single implicit community of asymptotically negligible
logarithmic size. We provide experiments on real and synthetic data to
illustrate the performance of our method, including examples with adversaries.Comment: Experiments with new types of adversaries adde
Spectral methods and computational trade-offs in high-dimensional statistical inference
Spectral methods have become increasingly popular in designing fast algorithms for modern highdimensional datasets. This thesis looks at several problems in which spectral methods play a central role. In some cases, we also show that such procedures have essentially the best performance among all randomised polynomial time algorithms by exhibiting statistical and computational trade-offs in those problems. In the first chapter, we prove a useful variant of the well-known Davis{Kahan theorem, which is a spectral perturbation result that allows us to bound of the distance between population eigenspaces and their sample versions. We then propose a semi-definite programming algorithm for the sparse principal component analysis (PCA) problem, and analyse its theoretical performance using the perturbation bounds we derived earlier. It turns out that the parameter regime in which our estimator is consistent is strictly smaller than the consistency regime of a minimax optimal (yet computationally intractable) estimator. We show through reduction from a well-known hard problem in computational complexity theory that the difference in consistency regimes is unavoidable for any randomised polynomial time estimator, hence revealing subtle statistical and computational trade-offs in this problem. Such computational trade-offs also exist in the problem of restricted isometry certification. Certifiers for restricted isometry properties can be used to construct design matrices for sparse linear regression problems. Similar to the sparse PCA problem, we show that there is also an intrinsic gap between the class of matrices certifiable using unrestricted algorithms and using polynomial time algorithms. Finally, we consider the problem of high-dimensional changepoint estimation, where we estimate the time of change in the mean of a high-dimensional time series with piecewise constant mean structure. Motivated by real world applications, we assume that changes only occur in a sparse subset of all coordinates. We apply a variant of the semi-definite programming algorithm in sparse PCA to aggregate the signals across different coordinates in a near optimal way so as to estimate the changepoint location as accurately as possible. Our statistical procedure shows superior performance compared to existing methods in this problem.St John's College and Cambridge Overseas Trus
Recommended from our members
Spectral methods and computational trade-offs in high-dimensional statistical inference
Spectral methods have become increasingly popular in designing fast algorithms for modern highdimensional datasets. This thesis looks at several problems in which spectral methods play a central role. In some cases, we also show that such procedures have essentially the best performance among all randomised polynomial time algorithms by exhibiting statistical and computational trade-offs in those problems. In the first chapter, we prove a useful variant of the well-known Davis{Kahan theorem, which is a spectral perturbation result that allows us to bound of the distance between population eigenspaces and their sample versions. We then propose a semi-definite programming algorithm for the sparse principal component analysis (PCA) problem, and analyse its theoretical performance using the perturbation bounds we derived earlier. It turns out that the parameter regime in which our estimator is consistent is strictly smaller than the consistency regime of a minimax optimal (yet computationally intractable) estimator. We show through reduction from a well-known hard problem in computational complexity theory that the difference in consistency regimes is unavoidable for any randomised polynomial time estimator, hence revealing subtle statistical and computational trade-offs in this problem. Such computational trade-offs also exist in the problem of restricted isometry certification. Certifiers for restricted isometry properties can be used to construct design matrices for sparse linear regression problems. Similar to the sparse PCA problem, we show that there is also an intrinsic gap between the class of matrices certifiable using unrestricted algorithms and using polynomial time algorithms. Finally, we consider the problem of high-dimensional changepoint estimation, where we estimate the time of change in the mean of a high-dimensional time series with piecewise constant mean structure. Motivated by real world applications, we assume that changes only occur in a sparse subset of all coordinates. We apply a variant of the semi-definite programming algorithm in sparse PCA to aggregate the signals across different coordinates in a near optimal way so as to estimate the changepoint location as accurately as possible. Our statistical procedure shows superior performance compared to existing methods in this problem.St John's College and Cambridge Overseas Trus
Spatial CUSUM for Signal Region Detection
Detecting weak clustered signal in spatial data is important but challenging
in applications such as medical image and epidemiology. A more efficient
detection algorithm can provide more precise early warning, and effectively
reduce the decision risk and cost. To date, many methods have been developed to
detect signals with spatial structures. However, most of the existing methods
are either too conservative for weak signals or computationally too intensive.
In this paper, we consider a novel method named Spatial CUSUM (SCUSUM), which
employs the idea of the CUSUM procedure and false discovery rate controlling.
We develop theoretical properties of the method which indicates that
asymptotically SCUSUM can reach high classification accuracy. In the simulation
study, we demonstrate that SCUSUM is sensitive to weak spatial signals. This
new method is applied to a real fMRI dataset as illustration, and more
irregular weak spatial signals are detected in the images compared to some
existing methods, including the conventional FDR, FDR and scan statistics
Bayesian anomaly detection methods for social networks
Learning the network structure of a large graph is computationally demanding,
and dynamically monitoring the network over time for any changes in structure
threatens to be more challenging still. This paper presents a two-stage method
for anomaly detection in dynamic graphs: the first stage uses simple, conjugate
Bayesian models for discrete time counting processes to track the pairwise
links of all nodes in the graph to assess normality of behavior; the second
stage applies standard network inference tools on a greatly reduced subset of
potentially anomalous nodes. The utility of the method is demonstrated on
simulated and real data sets.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS329 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Optimal change point detection and localization in sparse dynamic networks
We study the problem of change point localization in dynamic networks models. We assume that we observe a sequence of independent adjacency matrices of the same size, each corresponding to a realization of an unknown inhomogeneous Bernoulli model. The underlying distribution of the adjacency matrices are piecewise constant, and may change over a subset of the time points, called change points. We are concerned with recovering the unknown number and positions of the change points. In our model setting, we allow for all the model parameters to change with the total number of time points, including the network size, the minimal spacing between consecutive change points, the magnitude of the smallest change and the degree of sparsity of the networks. We first identify a region of impossibility in the space of the model parameters such that no change point estimator is provably consistent if the data are generated according to parameters falling in that region. We propose a computationally-simple algorithm for network change point localization, called network binary segmentation, that relies on weighted averages of the adjacency matrices. We show that network binary segmentation is consistent over a range of the model parameters that nearly cover the complement of the impossibility region, thus demonstrating the existence of a phase transition for the problem at hand. Next, we devise a more sophisticated algorithm based on singular value thresholding, called local refinement, that delivers more accurate estimates of the change point locations. Under appropriate conditions, local refinement guarantees a minimax optimal rate for network change point localization while remaining computationally feasible
- …