120 research outputs found
Massively Parallel Single-Source SimRanks in Rounds
SimRank is one of the most fundamental measures that evaluate the structural
similarity between two nodes in a graph and has been applied in a plethora of
data management tasks. These tasks often involve single-source SimRank
computation that evaluates the SimRank values between a source node and all
other nodes. Due to its high computation complexity, single-source SimRank
computation for large graphs is notoriously challenging, and hence recent
studies resort to distributed processing. To our surprise, although SimRank has
been widely adopted for two decades, theoretical aspects of distributed
SimRanks with provable results have rarely been studied.
In this paper, we conduct a theoretical study on single-source SimRank
computation in the Massive Parallel Computation (MPC) model, which is the
standard theoretical framework modeling distributed systems such as MapReduce,
Hadoop, or Spark. Existing distributed SimRank algorithms enforce either
communication round complexity or machine space
for a graph of nodes. We overcome this barrier. Particularly, given a graph
of nodes, for any query node and constant error ,
we show that using rounds of communication among machines is
almost enough to compute single-source SimRank values with at most
absolute errors, while each machine only needs a space sub-linear to . To
the best of our knowledge, this is the first single-source SimRank algorithm in
MPC that can overcome the round complexity barrier with
provable result accuracy
Learning-Based Approaches for Graph Problems: A Survey
Over the years, many graph problems specifically those in NP-complete are
studied by a wide range of researchers. Some famous examples include graph
colouring, travelling salesman problem and subgraph isomorphism. Most of these
problems are typically addressed by exact algorithms, approximate algorithms
and heuristics. There are however some drawback for each of these methods.
Recent studies have employed learning-based frameworks such as machine learning
techniques in solving these problems, given that they are useful in discovering
new patterns in structured data that can be represented using graphs. This
research direction has successfully attracted a considerable amount of
attention. In this survey, we provide a systematic review mainly on classic
graph problems in which learning-based approaches have been proposed in
addressing the problems. We discuss the overview of each framework, and provide
analyses based on the design and performance of the framework. Some potential
research questions are also suggested. Ultimately, this survey gives a clearer
insight and can be used as a stepping stone to the research community in
studying problems in this field.Comment: v1: 41 pages; v2: 40 page
Learning to Optimize LSM-trees: Towards A Reinforcement Learning based Key-Value Store for Dynamic Workloads
LSM-trees are widely adopted as the storage backend of key-value stores.
However, optimizing the system performance under dynamic workloads has not been
sufficiently studied or evaluated in previous work. To fill the gap, we present
RusKey, a key-value store with the following new features: (1) RusKey is a
first attempt to orchestrate LSM-tree structures online to enable robust
performance under the context of dynamic workloads; (2) RusKey is the first
study to use Reinforcement Learning (RL) to guide LSM-tree transformations; (3)
RusKey includes a new LSM-tree design, named FLSM-tree, for an efficient
transition between different compaction policies -- the bottleneck of dynamic
key-value stores. We justify the superiority of the new design with theoretical
analysis; (4) RusKey requires no prior workload knowledge for system
adjustment, in contrast to state-of-the-art techniques. Experiments show that
RusKey exhibits strong performance robustness in diverse workloads, achieving
up to 4x better end-to-end performance than the RocksDB system under various
settings.Comment: 25 pages, 13 figure
ESSAYS ON SOVEREIGN DEFAULT AND HOUSEHOLD PORTFOLIO CHOICE
This dissertation analyzes portfolio choice problems in different contexts. In the first chapter, “Nominal Exchange Rate Volatility, Default Risk and Reserve Accumulation,” I investigate how nominal exchange rate volatility affects a sovereign's portfolio choice between how much debt to acquire and how much reserves to accumulate. First, I document a positive correlation between nominal exchange rate volatility and sovereign default risk and show that this relationship becomes stronger when more of the external debt is denominated in foreign currency. Then, I build a sovereign default model to rationalize these findings and to quantify the channels that contribute to the large reserve holdings among emerging countries.
In the second chapter, “Household Portfolio Accounting,” we document and analyze the substantial heterogeneity in household portfolio composition in the United States. We consider a standard life-cycle model with labor income risk and portfolio choice, augmented with a savings wedge that lowers the return on saving, and a risky wedge that lowers the relative return on risky assets. Using U.S. survey data (2004-2016), we compute the household-level wedges that rationalize the data. The chapter has two main contributions: first, it uses the wedges to guide plausible frictions that researchers should consider in their models. Second, it analyzes the extent to which household characteristics can account for the wedges
DMCS : Density Modularity based Community Search
Community Search, or finding a connected subgraph (known as a community)
containing the given query nodes in a social network, is a fundamental problem.
Most of the existing community search models only focus on the internal
cohesiveness of a community. However, a high-quality community often has high
modularity, which means dense connections inside communities and sparse
connections to the nodes outside the community. In this paper, we conduct a
pioneer study on searching a community with high modularity. We point out that
while modularity has been popularly used in community detection (without query
nodes), it has not been adopted for community search, surprisingly, and its
application in community search (related to query nodes) brings in new
challenges. We address these challenges by designing a new graph modularity
function named Density Modularity. To the best of our knowledge, this is the
first work on the community search problem using graph modularity. The
community search based on the density modularity, termed as DMCS, is to find a
community in a social network that contains all the query nodes and has high
density-modularity. We prove that the DMCS problem is NP-hard. To efficiently
address DMCS, we present new algorithms that run in log-linear time to the
graph size. We conduct extensive experimental studies in real-world and
synthetic networks, which offer insights into the efficiency and effectiveness
of our algorithms. In particular, our algorithm achieves up to 8.5 times higher
accuracy in terms of NMI than baseline algorithms
SCARA: Scalable Graph Neural Networks with Feature-Oriented Optimization
Recent advances in data processing have stimulated the demand for learning
graphs of very large scales. Graph Neural Networks (GNNs), being an emerging
and powerful approach in solving graph learning tasks, are known to be
difficult to scale up. Most scalable models apply node-based techniques in
simplifying the expensive graph message-passing propagation procedure of GNN.
However, we find such acceleration insufficient when applied to million- or
even billion-scale graphs. In this work, we propose SCARA, a scalable GNN with
feature-oriented optimization for graph computation. SCARA efficiently computes
graph embedding from node features, and further selects and reuses feature
computation results to reduce overhead. Theoretical analysis indicates that our
model achieves sub-linear time complexity with a guaranteed precision in
propagation process as well as GNN training and inference. We conduct extensive
experiments on various datasets to evaluate the efficacy and efficiency of
SCARA. Performance comparison with baselines shows that SCARA can reach up to
100x graph propagation acceleration than current state-of-the-art methods with
fast convergence and comparable accuracy. Most notably, it is efficient to
process precomputation on the largest available billion-scale GNN dataset
Papers100M (111M nodes, 1.6B edges) in 100 seconds
Unsupervised detection of botnet activities using frequent pattern tree mining
A botnet is a network of remotely-controlled infected computers that can send spam, spread viruses, or stage denial-of-serviceattacks, without the consent of the computer owners. Since the beginning of the 21st century, botnet activities have steadilyincreased, becoming one of the major concerns for Internet security. In fact, botnet activities are becoming more and moredifficult to be detected, because they make use of Peer-to-Peer protocols (eMule, Torrent, Frostwire, Vuze, Skype and manyothers). To improve the detectability of botnet activities, this paper introduces the idea of association analysis in the field ofdata mining, and proposes a system to detect botnets based on the FP-growth (Frequent Pattern Tree) frequent item miningalgorithm. The detection system is composed of three parts: packet collection processing, rule mining, and statistical analysisof rules. Its characteristic feature is the rule-based classification of different botnet behaviors in a fast and unsupervisedfashion. The effectiveness of the approach is validated in a scenario with 11 Peer-to-Peer host PCs, 42063 Non-Peer-to-Peerhost PCs, and 17 host PCs with three different botnet activities (Storm, Waledac and Zeus). The recognition accuracy of theproposed architecture is shown to be above 94%. The proposed method is shown to improve the results reported in literature
- …