13,794 research outputs found
Fast and Accurate Random Walk with Restart on Dynamic Graphs with Guarantees
Given a time-evolving graph, how can we track similarity between nodes in a
fast and accurate way, with theoretical guarantees on the convergence and the
error? Random Walk with Restart (RWR) is a popular measure to estimate the
similarity between nodes and has been exploited in numerous applications. Many
real-world graphs are dynamic with frequent insertion/deletion of edges; thus,
tracking RWR scores on dynamic graphs in an efficient way has aroused much
interest among data mining researchers. Recently, dynamic RWR models based on
the propagation of scores across a given graph have been proposed, and have
succeeded in outperforming previous other approaches to compute RWR
dynamically. However, those models fail to guarantee exactness and convergence
time for updating RWR in a generalized form. In this paper, we propose OSP, a
fast and accurate algorithm for computing dynamic RWR with insertion/deletion
of nodes/edges in a directed/undirected graph. When the graph is updated, OSP
first calculates offset scores around the modified edges, propagates the offset
scores across the updated graph, and then merges them with the current RWR
scores to get updated RWR scores. We prove the exactness of OSP and introduce
OSP-T, a version of OSP which regulates a trade-off between accuracy and
computation time by using error tolerance {\epsilon}. Given restart probability
c, OSP-T guarantees to return RWR scores with O ({\epsilon} /c ) error in O
(log ({\epsilon}/2)/log(1-c)) iterations. Through extensive experiments, we
show that OSP tracks RWR exactly up to 4605x faster than existing static RWR
method on dynamic graphs, and OSP-T requires up to 15x less time with 730x
lower L1 norm error and 3.3x lower rank error than other state-of-the-art
dynamic RWR methods.Comment: 10 pages, 8 figure
TPA: Fast, Scalable, and Accurate Method for Approximate Random Walk with Restart on Billion Scale Graphs
Given a large graph, how can we determine similarity between nodes in a fast
and accurate way? Random walk with restart (RWR) is a popular measure for this
purpose and has been exploited in numerous data mining applications including
ranking, anomaly detection, link prediction, and community detection. However,
previous methods for computing exact RWR require prohibitive storage sizes and
computational costs, and alternative methods which avoid such costs by
computing approximate RWR have limited accuracy. In this paper, we propose TPA,
a fast, scalable, and highly accurate method for computing approximate RWR on
large graphs. TPA exploits two important properties in RWR: 1) nodes close to a
seed node are likely to be revisited in following steps due to block-wise
structure of many real-world graphs, and 2) RWR scores of nodes which reside
far from the seed node are proportional to their PageRank scores. Based on
these two properties, TPA divides approximate RWR problem into two subproblems
called neighbor approximation and stranger approximation. In the neighbor
approximation, TPA estimates RWR scores of nodes close to the seed based on
scores of few early steps from the seed. In the stranger approximation, TPA
estimates RWR scores for nodes far from the seed using their PageRank. The
stranger and neighbor approximations are conducted in the preprocessing phase
and the online phase, respectively. Through extensive experiments, we show that
TPA requires up to 3.5x less time with up to 40x less memory space than other
state-of-the-art methods for the preprocessing phase. In the online phase, TPA
computes approximate RWR up to 30x faster than existing methods while
maintaining high accuracy.Comment: 12pages, 10 figure
PLAZA 4.0 : an integrative resource for functional, evolutionary and comparative plant genomics
PLAZA (https://bioinformatics.psb.ugent.be/plaza) is a plant-oriented online resource for comparative, evolutionary and functional genomics. The PLAZA platform consists of multiple independent instances focusing on different plant clades, while also providing access to a consistent set of reference species. Each PLAZA instance contains structural and functional gene annotations, gene family data and phylogenetic trees and detailed gene colinearity information. A user-friendly web interface makes the necessary tools and visualizations accessible, specific for each data type. Here we present PLAZA 4.0, the latest iteration of the PLAZA framework. This version consists of two new instances (Dicots 4.0 and Monocots 4.0) providing a large increase in newly available species, and offers access to updated and newly implemented tools and visualizations, helping users with the ever-increasing demands for complex and in-depth analyzes. The total number of species across both instances nearly doubles from 37 species in PLAZA 3.0 to 71 species in PLAZA 4.0, with a much broader coverage of crop species (e.g. wheat, palm oil) and species of evolutionary interest (e.g. spruce, Marchantia). The new PLAZA instances can also be accessed by a programming interface through a RESTful web service, thus allowing bioinformaticians to optimally leverage the power of the PLAZA platform
AUC Optimisation and Collaborative Filtering
In recommendation systems, one is interested in the ranking of the predicted
items as opposed to other losses such as the mean squared error. Although a
variety of ways to evaluate rankings exist in the literature, here we focus on
the Area Under the ROC Curve (AUC) as it widely used and has a strong
theoretical underpinning. In practical recommendation, only items at the top of
the ranked list are presented to the users. With this in mind, we propose a
class of objective functions over matrix factorisations which primarily
represent a smooth surrogate for the real AUC, and in a special case we show
how to prioritise the top of the list. The objectives are differentiable and
optimised through a carefully designed stochastic gradient-descent-based
algorithm which scales linearly with the size of the data. In the special case
of square loss we show how to improve computational complexity by leveraging
previously computed measures. To understand theoretically the underlying matrix
factorisation approaches we study both the consistency of the loss functions
with respect to AUC, and generalisation using Rademacher theory. The resulting
generalisation analysis gives strong motivation for the optimisation under
study. Finally, we provide computation results as to the efficacy of the
proposed method using synthetic and real data
- …