151 research outputs found
Innovation Pursuit: A New Approach to Subspace Clustering
In subspace clustering, a group of data points belonging to a union of
subspaces are assigned membership to their respective subspaces. This paper
presents a new approach dubbed Innovation Pursuit (iPursuit) to the problem of
subspace clustering using a new geometrical idea whereby subspaces are
identified based on their relative novelties. We present two frameworks in
which the idea of innovation pursuit is used to distinguish the subspaces.
Underlying the first framework is an iterative method that finds the subspaces
consecutively by solving a series of simple linear optimization problems, each
searching for a direction of innovation in the span of the data potentially
orthogonal to all subspaces except for the one to be identified in one step of
the algorithm. A detailed mathematical analysis is provided establishing
sufficient conditions for iPursuit to correctly cluster the data. The proposed
approach can provably yield exact clustering even when the subspaces have
significant intersections. It is shown that the complexity of the iterative
approach scales only linearly in the number of data points and subspaces, and
quadratically in the dimension of the subspaces. The second framework
integrates iPursuit with spectral clustering to yield a new variant of
spectral-clustering-based algorithms. The numerical simulations with both real
and synthetic data demonstrate that iPursuit can often outperform the
state-of-the-art subspace clustering algorithms, more so for subspaces with
significant intersections, and that it significantly improves the
state-of-the-art result for subspace-segmentation-based face clustering
Data Dropout in Arbitrary Basis for Deep Network Regularization
An important problem in training deep networks with high capacity is to
ensure that the trained network works well when presented with new inputs
outside the training dataset. Dropout is an effective regularization technique
to boost the network generalization in which a random subset of the elements of
the given data and the extracted features are set to zero during the training
process. In this paper, a new randomized regularization technique in which we
withhold a random part of the data without necessarily turning off the
neurons/data-elements is proposed. In the proposed method, of which the
conventional dropout is shown to be a special case, random data dropout is
performed in an arbitrary basis, hence the designation Generalized Dropout. We
also present a framework whereby the proposed technique can be applied
efficiently to convolutional neural networks. The presented numerical
experiments demonstrate that the proposed technique yields notable performance
gain. Generalized Dropout provides new insight into the idea of dropout, shows
that we can achieve different performance gains by using different bases
matrices, and opens up a new research question as of how to choose optimal
bases matrices that achieve maximal performance gain
High Dimensional Low Rank plus Sparse Matrix Decomposition
This paper is concerned with the problem of low rank plus sparse matrix
decomposition for big data. Conventional algorithms for matrix decomposition
use the entire data to extract the low-rank and sparse components, and are
based on optimization problems with complexity that scales with the dimension
of the data, which limits their scalability. Furthermore, existing randomized
approaches mostly rely on uniform random sampling, which is quite inefficient
for many real world data matrices that exhibit additional structures (e.g.
clustering). In this paper, a scalable subspace-pursuit approach that
transforms the decomposition problem to a subspace learning problem is
proposed. The decomposition is carried out using a small data sketch formed
from sampled columns/rows. Even when the data is sampled uniformly at random,
it is shown that the sufficient number of sampled columns/rows is roughly
O(r\mu), where \mu is the coherency parameter and r the rank of the low rank
component. In addition, adaptive sampling algorithms are proposed to address
the problem of column/row sampling from structured data. We provide an analysis
of the proposed method with adaptive sampling and show that adaptive sampling
makes the required number of sampled columns/rows invariant to the distribution
of the data. The proposed approach is amenable to online implementation and an
online scheme is proposed.Comment: IEEE Transactions on Signal Processin
Spatial Random Sampling: A Structure-Preserving Data Sketching Tool
Random column sampling is not guaranteed to yield data sketches that preserve
the underlying structures of the data and may not sample sufficiently from
less-populated data clusters. Also, adaptive sampling can often provide
accurate low rank approximations, yet may fall short of producing descriptive
data sketches, especially when the cluster centers are linearly dependent.
Motivated by that, this paper introduces a novel randomized column sampling
tool dubbed Spatial Random Sampling (SRS), in which data points are sampled
based on their proximity to randomly sampled points on the unit sphere. The
most compelling feature of SRS is that the corresponding probability of
sampling from a given data cluster is proportional to the surface area the
cluster occupies on the unit sphere, independently from the size of the cluster
population. Although it is fully randomized, SRS is shown to provide
descriptive and balanced data representations. The proposed idea addresses a
pressing need in data science and holds potential to inspire many novel
approaches for analysis of big data
Stuck in Traffic (SiT) Attacks: A Framework for Identifying Stealthy Attacks that Cause Traffic Congestion
Recent advances in wireless technologies have enabled many new applications
in Intelligent Transportation Systems (ITS) such as collision avoidance,
cooperative driving, congestion avoidance, and traffic optimization. Due to the
vulnerable nature of wireless communication against interference and
intentional jamming, ITS face new challenges to ensure the reliability and the
safety of the overall system. In this paper, we expose a class of stealthy
attacks -- Stuck in Traffic (SiT) attacks -- that aim to cause congestion by
exploiting how drivers make decisions based on smart traffic signs. An attacker
mounting a SiT attack solves a Markov Decision Process problem to find
optimal/suboptimal attack policies in which he/she interferes with a
well-chosen subset of signals that are based on the state of the system. We
apply Approximate Policy Iteration (API) algorithms to derive potent attack
policies. We evaluate their performance on a number of systems and compare them
to other attack policies including random, myopic and DoS attack policies. The
generated policies, albeit suboptimal, are shown to significantly outperform
other attack policies as they maximize the expected cumulative reward from the
standpoint of the attacker
Subspace Clustering via Optimal Direction Search
This letter presents a new spectral-clustering-based approach to the subspace
clustering problem. Underpinning the proposed method is a convex program for
optimal direction search, which for each data point d finds an optimal
direction in the span of the data that has minimum projection on the other data
points and non-vanishing projection on d. The obtained directions are
subsequently leveraged to identify a neighborhood set for each data point. An
alternating direction method of multipliers framework is provided to
efficiently solve for the optimal directions. The proposed method is shown to
notably outperform the existing subspace clustering methods, particularly for
unwieldy scenarios involving high levels of noise and close subspaces, and
yields the state-of-the-art results for the problem of face clustering using
subspace segmentation
Randomized Robust Subspace Recovery for High Dimensional Data Matrices
This paper explores and analyzes two randomized designs for robust Principal
Component Analysis (PCA) employing low-dimensional data sketching. In one
design, a data sketch is constructed using random column sampling followed by
low dimensional embedding, while in the other, sketching is based on random
column and row sampling. Both designs are shown to bring about substantial
savings in complexity and memory requirements for robust subspace learning over
conventional approaches that use the full scale data. A characterization of the
sample and computational complexity of both designs is derived in the context
of two distinct outlier models, namely, sparse and independent outlier models.
The proposed randomized approach can provably recover the correct subspace with
computational and sample complexity that are almost independent of the size of
the data. The results of the mathematical analysis are confirmed through
numerical simulations using both synthetic and real data
Boolean Compressed Sensing and Noisy Group Testing
The fundamental task of group testing is to recover a small distinguished
subset of items from a large population while efficiently reducing the total
number of tests (measurements). The key contribution of this paper is in
adopting a new information-theoretic perspective on group testing problems. We
formulate the group testing problem as a channel coding/decoding problem and
derive a single-letter characterization for the total number of tests used to
identify the defective set. Although the focus of this paper is primarily on
group testing, our main result is generally applicable to other compressive
sensing models.
The single letter characterization is shown to be order-wise tight for many
interesting noisy group testing scenarios. Specifically, we consider an
additive Bernoulli() noise model where we show that, for items and
defectives, the number of tests is for arbitrarily
small average error probability and for a worst case
error criterion. We also consider dilution effects whereby a defective item in
a positive pool might get diluted with probability and potentially missed.
In this case, it is shown that is and
for the average and the worst case error
criteria, respectively. Furthermore, our bounds allow us to verify existing
known bounds for noiseless group testing including the deterministic noise-free
case and approximate reconstruction with bounded distortion. Our proof of
achievability is based on random coding and the analysis of a Maximum
Likelihood Detector, and our information theoretic lower bound is based on
Fano's inequality.Comment: In this revision: reorganized the paper, added citations to related
work, and fixed some bug
Controlled Sensing for Multihypothesis Testing
The problem of multiple hypothesis testing with observation control is
considered in both fixed sample size and sequential settings. In the fixed
sample size setting, for binary hypothesis testing, the optimal exponent for
the maximal error probability corresponds to the maximum Chernoff information
over the choice of controls, and a pure stationary open-loop control policy is
asymptotically optimal within the larger class of all causal control policies.
For multihypothesis testing in the fixed sample size setting, lower and upper
bounds on the optimal error exponent are derived. It is also shown through an
example with three hypotheses that the optimal causal control policy can be
strictly better than the optimal open-loop control policy. In the sequential
setting, a test based on earlier work by Chernoff for binary hypothesis
testing, is shown to be first-order asymptotically optimal for multihypothesis
testing in a strong sense, using the notion of decision making risk in place of
the overall probability of error. Another test is also designed to meet hard
risk constrains while retaining asymptotic optimality. The role of past
information and randomization in designing optimal control policies is
discussed.Comment: To appear in the Transactions on Automatic Contro
- β¦