36 research outputs found
Massively Parallel Single-Source SimRanks in Rounds
SimRank is one of the most fundamental measures that evaluate the structural
similarity between two nodes in a graph and has been applied in a plethora of
data management tasks. These tasks often involve single-source SimRank
computation that evaluates the SimRank values between a source node and all
other nodes. Due to its high computation complexity, single-source SimRank
computation for large graphs is notoriously challenging, and hence recent
studies resort to distributed processing. To our surprise, although SimRank has
been widely adopted for two decades, theoretical aspects of distributed
SimRanks with provable results have rarely been studied.
In this paper, we conduct a theoretical study on single-source SimRank
computation in the Massive Parallel Computation (MPC) model, which is the
standard theoretical framework modeling distributed systems such as MapReduce,
Hadoop, or Spark. Existing distributed SimRank algorithms enforce either
communication round complexity or machine space
for a graph of nodes. We overcome this barrier. Particularly, given a graph
of nodes, for any query node and constant error ,
we show that using rounds of communication among machines is
almost enough to compute single-source SimRank values with at most
absolute errors, while each machine only needs a space sub-linear to . To
the best of our knowledge, this is the first single-source SimRank algorithm in
MPC that can overcome the round complexity barrier with
provable result accuracy
Learning-Based Approaches for Graph Problems: A Survey
Over the years, many graph problems specifically those in NP-complete are
studied by a wide range of researchers. Some famous examples include graph
colouring, travelling salesman problem and subgraph isomorphism. Most of these
problems are typically addressed by exact algorithms, approximate algorithms
and heuristics. There are however some drawback for each of these methods.
Recent studies have employed learning-based frameworks such as machine learning
techniques in solving these problems, given that they are useful in discovering
new patterns in structured data that can be represented using graphs. This
research direction has successfully attracted a considerable amount of
attention. In this survey, we provide a systematic review mainly on classic
graph problems in which learning-based approaches have been proposed in
addressing the problems. We discuss the overview of each framework, and provide
analyses based on the design and performance of the framework. Some potential
research questions are also suggested. Ultimately, this survey gives a clearer
insight and can be used as a stepping stone to the research community in
studying problems in this field.Comment: v1: 41 pages; v2: 40 page
Learning to Optimize LSM-trees: Towards A Reinforcement Learning based Key-Value Store for Dynamic Workloads
LSM-trees are widely adopted as the storage backend of key-value stores.
However, optimizing the system performance under dynamic workloads has not been
sufficiently studied or evaluated in previous work. To fill the gap, we present
RusKey, a key-value store with the following new features: (1) RusKey is a
first attempt to orchestrate LSM-tree structures online to enable robust
performance under the context of dynamic workloads; (2) RusKey is the first
study to use Reinforcement Learning (RL) to guide LSM-tree transformations; (3)
RusKey includes a new LSM-tree design, named FLSM-tree, for an efficient
transition between different compaction policies -- the bottleneck of dynamic
key-value stores. We justify the superiority of the new design with theoretical
analysis; (4) RusKey requires no prior workload knowledge for system
adjustment, in contrast to state-of-the-art techniques. Experiments show that
RusKey exhibits strong performance robustness in diverse workloads, achieving
up to 4x better end-to-end performance than the RocksDB system under various
settings.Comment: 25 pages, 13 figure
DMCS : Density Modularity based Community Search
Community Search, or finding a connected subgraph (known as a community)
containing the given query nodes in a social network, is a fundamental problem.
Most of the existing community search models only focus on the internal
cohesiveness of a community. However, a high-quality community often has high
modularity, which means dense connections inside communities and sparse
connections to the nodes outside the community. In this paper, we conduct a
pioneer study on searching a community with high modularity. We point out that
while modularity has been popularly used in community detection (without query
nodes), it has not been adopted for community search, surprisingly, and its
application in community search (related to query nodes) brings in new
challenges. We address these challenges by designing a new graph modularity
function named Density Modularity. To the best of our knowledge, this is the
first work on the community search problem using graph modularity. The
community search based on the density modularity, termed as DMCS, is to find a
community in a social network that contains all the query nodes and has high
density-modularity. We prove that the DMCS problem is NP-hard. To efficiently
address DMCS, we present new algorithms that run in log-linear time to the
graph size. We conduct extensive experimental studies in real-world and
synthetic networks, which offer insights into the efficiency and effectiveness
of our algorithms. In particular, our algorithm achieves up to 8.5 times higher
accuracy in terms of NMI than baseline algorithms
SCARA: Scalable Graph Neural Networks with Feature-Oriented Optimization
Recent advances in data processing have stimulated the demand for learning
graphs of very large scales. Graph Neural Networks (GNNs), being an emerging
and powerful approach in solving graph learning tasks, are known to be
difficult to scale up. Most scalable models apply node-based techniques in
simplifying the expensive graph message-passing propagation procedure of GNN.
However, we find such acceleration insufficient when applied to million- or
even billion-scale graphs. In this work, we propose SCARA, a scalable GNN with
feature-oriented optimization for graph computation. SCARA efficiently computes
graph embedding from node features, and further selects and reuses feature
computation results to reduce overhead. Theoretical analysis indicates that our
model achieves sub-linear time complexity with a guaranteed precision in
propagation process as well as GNN training and inference. We conduct extensive
experiments on various datasets to evaluate the efficacy and efficiency of
SCARA. Performance comparison with baselines shows that SCARA can reach up to
100x graph propagation acceleration than current state-of-the-art methods with
fast convergence and comparable accuracy. Most notably, it is efficient to
process precomputation on the largest available billion-scale GNN dataset
Papers100M (111M nodes, 1.6B edges) in 100 seconds
Label Propagation for Graph Label Noise
Label noise is a common challenge in large datasets, as it can significantly
degrade the generalization ability of deep neural networks. Most existing
studies focus on noisy labels in computer vision; however, graph models
encompass both node features and graph topology as input, and become more
susceptible to label noise through message-passing mechanisms. Recently, only a
few works have been proposed to tackle the label noise on graphs. One major
limitation is that they assume the graph is homophilous and the labels are
smoothly distributed. Nevertheless, real-world graphs may contain varying
degrees of heterophily or even be heterophily-dominated, leading to the
inadequacy of current methods. In this paper, we study graph label noise in the
context of arbitrary heterophily, with the aim of rectifying noisy labels and
assigning labels to previously unlabeled nodes. We begin by conducting two
empirical analyses to explore the impact of graph homophily on graph label
noise. Following observations, we propose a simple yet efficient algorithm,
denoted as LP4GLN. Specifically, LP4GLN is an iterative algorithm with three
steps: (1) reconstruct the graph to recover the homophily property, (2) utilize
label propagation to rectify the noisy labels, (3) select high-confidence
labels to retain for the next iteration. By iterating these steps, we obtain a
set of correct labels, ultimately achieving high accuracy in the node
classification task. The theoretical analysis is also provided to demonstrate
its remarkable denoising "effect". Finally, we conduct experiments on 10
benchmark datasets under varying graph heterophily levels and noise types,
comparing the performance of LP4GLN with 7 typical baselines. Our results
illustrate the superior performance of the proposed LP4GLN