6 research outputs found
Regularized Laplacian Estimation and Fast Eigenvector Approximation
Recently, Mahoney and Orecchia demonstrated that popular diffusion-based
procedures to compute a quick \emph{approximation} to the first nontrivial
eigenvector of a data graph Laplacian \emph{exactly} solve certain regularized
Semi-Definite Programs (SDPs). In this paper, we extend that result by
providing a statistical interpretation of their approximation procedure. Our
interpretation will be analogous to the manner in which -regularized or
-regularized -regression (often called Ridge regression and
Lasso regression, respectively) can be interpreted in terms of a Gaussian prior
or a Laplace prior, respectively, on the coefficient vector of the regression
problem. Our framework will imply that the solutions to the Mahoney-Orecchia
regularized SDP can be interpreted as regularized estimates of the
pseudoinverse of the graph Laplacian. Conversely, it will imply that the
solution to this regularized estimation problem can be computed very quickly by
running, e.g., the fast diffusion-based PageRank procedure for computing an
approximation to the first nontrivial eigenvector of the graph Laplacian.
Empirical results are also provided to illustrate the manner in which
approximate eigenvector computation \emph{implicitly} performs statistical
regularization, relative to running the corresponding exact algorithm.Comment: 13 pages and 3 figures. A more detailed version of a paper appearing
in the 2011 NIPS Conferenc
Approximate Computation and Implicit Regularization for Very Large-scale Data Analysis
Database theory and database practice are typically the domain of computer
scientists who adopt what may be termed an algorithmic perspective on their
data. This perspective is very different than the more statistical perspective
adopted by statisticians, scientific computers, machine learners, and other who
work on what may be broadly termed statistical data analysis. In this article,
I will address fundamental aspects of this algorithmic-statistical disconnect,
with an eye to bridging the gap between these two very different approaches. A
concept that lies at the heart of this disconnect is that of statistical
regularization, a notion that has to do with how robust is the output of an
algorithm to the noise properties of the input data. Although it is nearly
completely absent from computer science, which historically has taken the input
data as given and modeled algorithms discretely, regularization in one form or
another is central to nearly every application domain that applies algorithms
to noisy data. By using several case studies, I will illustrate, both
theoretically and empirically, the nonobvious fact that approximate
computation, in and of itself, can implicitly lead to statistical
regularization. This and other recent work suggests that, by exploiting in a
more principled way the statistical properties implicit in worst-case
algorithms, one can in many cases satisfy the bicriteria of having algorithms
that are scalable to very large-scale databases and that also have good
inferential or predictive properties.Comment: To appear in the Proceedings of the 2012 ACM Symposium on Principles
of Database Systems (PODS 2012
Semi-supervised Eigenvectors for Large-scale Locally-biased Learning
In many applications, one has side information, e.g., labels that are
provided in a semi-supervised manner, about a specific target region of a large
data set, and one wants to perform machine learning and data analysis tasks
"nearby" that prespecified target region. For example, one might be interested
in the clustering structure of a data graph near a prespecified "seed set" of
nodes, or one might be interested in finding partitions in an image that are
near a prespecified "ground truth" set of pixels. Locally-biased problems of
this sort are particularly challenging for popular eigenvector-based machine
learning and data analysis tools. At root, the reason is that eigenvectors are
inherently global quantities, thus limiting the applicability of
eigenvector-based methods in situations where one is interested in very local
properties of the data.
In this paper, we address this issue by providing a methodology to construct
semi-supervised eigenvectors of a graph Laplacian, and we illustrate how these
locally-biased eigenvectors can be used to perform locally-biased machine
learning. These semi-supervised eigenvectors capture
successively-orthogonalized directions of maximum variance, conditioned on
being well-correlated with an input seed set of nodes that is assumed to be
provided in a semi-supervised manner. We show that these semi-supervised
eigenvectors can be computed quickly as the solution to a system of linear
equations; and we also describe several variants of our basic method that have
improved scaling properties. We provide several empirical examples
demonstrating how these semi-supervised eigenvectors can be used to perform
locally-biased learning; and we discuss the relationship between our results
and recent machine learning algorithms that use global eigenvectors of the
graph Laplacian