674 research outputs found
PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs
{\it SimRank} is a classic measure of the similarities of nodes in a graph.
Given a node in graph , a {\em single-source SimRank query}
returns the SimRank similarities between node and each node . This type of queries has numerous applications in web search and social
networks analysis, such as link prediction, web mining, and spam detection.
Existing methods for single-source SimRank queries, however, incur query cost
at least linear to the number of nodes , which renders them inapplicable for
real-time and interactive analysis.
{ This paper proposes \prsim, an algorithm that exploits the structure of
graphs to efficiently answer single-source SimRank queries. \prsim uses an
index of size , where is the number of edges in the graph, and
guarantees a query time that depends on the {\em reverse PageRank} distribution
of the input graph. In particular, we prove that \prsim runs in sub-linear time
if the degree distribution of the input graph follows the power-law
distribution, a property possessed by many real-world graphs. Based on the
theoretical analysis, we show that the empirical query time of all existing
SimRank algorithms also depends on the reverse PageRank distribution of the
graph.} Finally, we present the first experimental study that evaluates the
absolute errors of various SimRank algorithms on large graphs, and we show that
\prsim outperforms the state of the art in terms of query time, accuracy, index
size, and scalability.Comment: ACM SIGMOD 201
Exact Single-Source SimRank Computation on Large Graphs
SimRank is a popular measurement for evaluating the node-to-node similarities
based on the graph topology. In recent years, single-source and top- SimRank
queries have received increasing attention due to their applications in web
mining, social network analysis, and spam detection. However, a fundamental
obstacle in studying SimRank has been the lack of ground truths. The only exact
algorithm, Power Method, is computationally infeasible on graphs with more than
nodes. Consequently, no existing work has evaluated the actual
trade-offs between query time and accuracy on large real-world graphs. In this
paper, we present ExactSim, the first algorithm that computes the exact
single-source and top- SimRank results on large graphs. With high
probability, this algorithm produces ground truths with a rigorous theoretical
guarantee. We conduct extensive experiments on real-world datasets to
demonstrate the efficiency of ExactSim. The results show that ExactSim provides
the ground truth for any single-source SimRank query with a precision up to 7
decimal places within a reasonable query time.Comment: ACM SIGMOD 202
Personalized PageRank on Evolving Graphs with an Incremental Index-Update Scheme
{\em Personalized PageRank (PPR)} stands as a fundamental proximity measure
in graph mining. Since computing an exact SSPPR query answer is prohibitive,
most existing solutions turn to approximate queries with guarantees. The
state-of-the-art solutions for approximate SSPPR queries are index-based and
mainly focus on static graphs, while real-world graphs are usually dynamically
changing. However, existing index-update schemes can not achieve a sub-linear
update time. Motivated by this, we present an efficient indexing scheme to
maintain indexed random walks in expected time after each graph update.
To reduce the space consumption, we further propose a new sampling scheme to
remove the auxiliary data structure for vertices while still supporting
index update cost on evolving graphs. Extensive experiments show that our
update scheme achieves orders of magnitude speed-up on update performance over
existing index-based dynamic schemes without sacrificing the query efficiency
COMPUTING APPROXIMATE CUSTOMIZED RANKING
As the amount of information grows and as users become more
sophisticated, ranking techniques become important building blocks
to meet user needs when answering queries. PageRank is one of the
most successful link-based ranking methods, which iteratively
computes the importance scores for web pages based on the importance scores of incoming pages. Due to its success, PageRank has been applied in a number of applications that require customization.
We address the scalability challenges for two types of customized
ranking. The first challenge is to compute the ranking of a
subgraph. Various Web applications focus on identifying a
subgraph, such as focused crawlers and localized search engines.
The second challenge is to compute online personalized ranking.
Personalized search improves the quality of search results for each
user. The user needs are represented by a personalized set of pages
or personalized link importance in an entity relationship graph.
This requires an efficient online computation.
To solve the subgraph ranking problem efficiently, we estimate the
ranking scores for a subgraph. We propose a framework of an exact
solution (IdealRank) and an approximate solution (ApproxRank) for
computing ranking on a subgraph. Both IdealRank and ApproxRank
represent the set of external pages with an external node
and modify the PageRank-style transition matrix with respect to . The IdealRank algorithm assumes that the scores of external pages are known. We prove that the IdealRank scores for pages in the subgraph converge to the true PageRank scores. Since the PageRank-style scores of external pages may not typically be available, we propose the ApproxRank algorithm to estimate scores for the subgraph. We analyze the distance between IdealRank scores and ApproxRank scores of the subgraph and show that it is within a
constant factor of the distance of the external pages. We demonstrate with real and synthetic data that ApproxRank provides a good approximation to PageRank for a variety of subgraphs.
We consider online personalization using ObjectRank; it is an
authority flow based ranking for entity relationship graphs. We formalize the concept of an aggregate surfer on a data graph; the surfer's behavior is controlled by multiple personalized rankings. We prove a linearity
theorem over these rankings which can be used as a tool to scale
this type of personalization. DataApprox uses a repository of precomputed rankings for a given set of link weights assignments. We define DataApprox as an optimization problem; it selects a subset of the precomputed rankings from the repository and produce a weighted combination of these rankings. We analyze the distance between the DataApprox scores and the real authority flow ranking scores and show that DataApprox has a theoretical bound. Our experiments on the DBLP data graph show that DataApprox performs well in practice and allows fast and accurate personalized authority flow ranking
- …