19,481 research outputs found
HYPE: Massive Hypergraph Partitioning with Neighborhood Expansion
Many important real-world applications-such as social networks or distributed
data bases-can be modeled as hypergraphs. In such a model, vertices represent
entities-such as users or data records-whereas hyperedges model a group
membership of the vertices-such as the authorship in a specific topic or the
membership of a data record in a specific replicated shard. To optimize such
applications, we need an efficient and effective solution to the NP-hard
balanced k-way hypergraph partitioning problem. However, existing hypergraph
partitioners that scale to very large graphs do not effectively exploit the
hypergraph structure when performing the partitioning decisions. We propose
HYPE, a hypergraph partitionier that exploits the neighborhood relations
between vertices in the hypergraph using an efficient implementation of
neighborhood expansion. HYPE improves partitioning quality by up to 95% and
reduces runtime by up to 39% compared to streaming partitioning.Comment: To appear in Proceedings of IEEE 2018 International Conference on Big
Data (BigData '18), 10 page
Graph Partitioning via Parallel Submodular Approximation to Accelerate Distributed Machine Learning
Distributed computing excels at processing large scale data, but the
communication cost for synchronizing the shared parameters may slow down the
overall performance. Fortunately, the interactions between parameter and data
in many problems are sparse, which admits efficient partition in order to
reduce the communication overhead.
In this paper, we formulate data placement as a graph partitioning problem.
We propose a distributed partitioning algorithm. We give both theoretical
guarantees and a highly efficient implementation. We also provide a highly
efficient implementation of the algorithm and demonstrate its promising results
on both text datasets and social networks. We show that the proposed algorithm
leads to 1.6x speedup of a state-of-the-start distributed machine learning
system by eliminating 90\% of the network communication
Physics-Aware Neural Networks for Distribution System State Estimation
The distribution system state estimation problem seeks to determine the
network state from available measurements. Widely used Gauss-Newton approaches
are very sensitive to the initialization and often not suitable for real-time
estimation. Learning approaches are very promising for real-time estimation, as
they shift the computational burden to an offline training stage. Prior machine
learning approaches to power system state estimation have been electrical
model-agnostic, in that they did not exploit the topology and physical laws
governing the power grid to design the architecture of the learning model. In
this paper, we propose a novel learning model that utilizes the structure of
the power grid. The proposed neural network architecture reduces the number of
coefficients needed to parameterize the mapping from the measurements to the
network state by exploiting the separability of the estimation problem. This
prevents overfitting and reduces the complexity of the training stage. We also
propose a greedy algorithm for phasor measuring units placement that aims at
minimizing the complexity of the neural network required for realizing the
state estimation mapping. Simulation results show superior performance of the
proposed method over the Gauss-Newton approach.Comment: 8 pages, 5 figures, 3 table
Social Hash Partitioner: A Scalable Distributed Hypergraph Partitioner
We design and implement a distributed algorithm for balanced -way
hypergraph partitioning that minimizes fanout, a fundamental hypergraph
quantity also known as the communication volume and ()-cut metric, by
optimizing a novel objective called probabilistic fanout. This choice allows a
simple local search heuristic to achieve comparable solution quality to the
best existing hypergraph partitioners.
Our algorithm is arbitrarily scalable due to a careful design that controls
computational complexity, space complexity, and communication. In practice, we
commonly process hypergraphs with billions of vertices and hyperedges in a few
hours. We explain how the algorithm's scalability, both in terms of hypergraph
size and bucket count, is limited only by the number of machines available. We
perform an extensive comparison to existing distributed hypergraph partitioners
and find that our approach is able to optimize hypergraphs roughly times
bigger on the same set of machines.
We call the resulting tool Social Hash Partitioner (SHP), and accompanying
this paper, we open-source the most scalable version based on recursive
bisection.Comment: Proceedings of the VLDB Endowment 201
DRONE: a Distributed Subgraph-Centric Framework for Processing Large Scale Power-law Graphs
Nowadays, in the big data era, social networks, graph databases, knowledge
graphs, electronic commerce etc. demand efficient and scalable capability to
process an ever increasing volume of graph-structured data. To meet the
challenge, two mainstream distributed programming models, vertex-centric (VC)
and subgraph-centric (SC) were proposed. Compared to the VC model, the SC model
converges faster with less communication overhead on well-partitioned graphs,
and is easy to program due to the "think like a graph" philosophy. The edge-cut
method is considered as a natural choice of subgraph-centric model for graph
partitioning, and has been adopted by Giraph++, Blogel and GRAPE. However, the
edge-cut method causes significant performance bottleneck for processing large
scale power-law graphs. Thus, the SC model is less competitive in practice. In
this paper, we present an innovative distributed graph computing framework,
DRONE (Distributed gRaph cOmputiNg Engine). It combines the subgraph-centric
model and the vertex-cut graph partitioning strategy. Experiments show that
DRONE outperforms the state-of-art distributed graph computing engines on
real-world graphs and synthetic power-law graphs. DRONE is capable of scaling
up to process one-trillion-edge synthetic power-law graphs, which is orders of
magnitude larger than previously reported by existing SC-based frameworks.Comment: 13 pages, 9 figure
Device Placement Optimization with Reinforcement Learning
The past few years have witnessed a growth in size and computational
requirements for training and inference with neural networks. Currently, a
common approach to address these requirements is to use a heterogeneous
distributed environment with a mixture of hardware devices such as CPUs and
GPUs. Importantly, the decision of placing parts of the neural models on
devices is often made by human experts based on simple heuristics and
intuitions. In this paper, we propose a method which learns to optimize device
placement for TensorFlow computational graphs. Key to our method is the use of
a sequence-to-sequence model to predict which subsets of operations in a
TensorFlow graph should run on which of the available devices. The execution
time of the predicted placements is then used as the reward signal to optimize
the parameters of the sequence-to-sequence model. Our main result is that on
Inception-V3 for ImageNet classification, and on RNN LSTM, for language
modeling and neural machine translation, our model finds non-trivial device
placements that outperform hand-crafted heuristics and traditional algorithmic
methods.Comment: To appear at ICML 201
Search-based Tier Assignment for Optimising Offline Availability in Multi-tier Web Applications
Web programmers are often faced with several challenges in the development
process of modern, rich internet applications. Technologies for the different
tiers of the application have to be selected: a server-side language, a
combination of JavaScript, HTML and CSS for the client, and a database
technology. Meeting the expectations of contemporary web applications requires
even more effort from the developer: many state of the art libraries must be
mastered and glued together. This leads to an impedance mismatch problem
between the different technologies and it is up to the programmer to align them
manually. Multi-tier or tierless programming is a web programming paradigm that
provides one language for the different tiers of the web application, allowing
the programmer to focus on the actual program logic instead of the accidental
complexity that comes from combining several technologies. While current
tierless approaches therefore relieve the burden of having to combine different
technologies into one application, the distribution of the code is explicitly
tied into the program. Certain distribution decisions have an impact on
crosscutting concerns such as information security or offline availability.
Moreover, adapting the programs such that the application complies better with
these concerns often leads to code tangling, rendering the program more
difficult to understand and maintain. We introduce an approach to multi-tier
programming where the tierless code is decoupled from the tier specification.
The developer implements the web application in terms of slices and an external
specification that assigns the slices to tiers. A recommender system completes
the picture for those slices that do not have a fixed placement and proposes
slice refinements as well. This recommender system tries to optimise the tier
specification with respect to one or more crosscutting concerns. This is in
contrast with current cutting edge solutions that hide distribution decisions
from the programmer. In this paper we show that slices, together with a
recommender system, enable the developer to experiment with different
placements of slices, until the distribution of the code satisfies the
programmer's needs. We present a search-based recommender system that maximises
the offline availability of a web application and a concrete implementation of
these concepts in the tier-splitting tool Stip.js
Accelerated Training for CNN Distributed Deep Learning through Automatic Resource-Aware Layer Placement
The Convolutional Neural Network (CNN) model, often used for image
classification, requires significant training time to obtain high accuracy. To
this end, distributed training is performed with the parameter server (PS)
architecture using multiple servers. Unfortunately, scalability has been found
to be poor in existing architectures. We find that the PS network is the
bottleneck as it communicates a large number of gradients and parameters with
the many workers. This is because synchronization with the many workers has to
occur at every step of training. Depending on the model, communication can be
in the several hundred MBs per synchronization. In this paper, we propose a
scheme to reduce network traffic through layer placement that considers the
resources that each layer uses. Through analysis of the characteristics of CNN,
we find that placement of layers can be done in an effective manner. We then
incorporate this observation within the TensorFlow framework such that layers
can be automatically placed for more efficient training. Our evaluation making
use of this placement scheme show that training time can be significantly
reduced without loss of accuracy for many CNN models
Building User-defined Runtime Adaptation Routines for Stream Processing Applications
Stream processing applications are deployed as continuous queries that run
from the time of their submission until their cancellation. This deployment
mode limits developers who need their applications to perform runtime
adaptation, such as algorithmic adjustments, incremental job deployment, and
application-specific failure recovery. Currently, developers do runtime
adaptation by using external scripts and/or by inserting operators into the
stream processing graph that are unrelated to the data processing logic. In
this paper, we describe a component called orchestrator that allows users to
write routines for automatically adapting the application to runtime
conditions. Developers build an orchestrator by registering and handling events
as well as specifying actuations. Events can be generated due to changes in the
system state (e.g., application component failures), built-in system metrics
(e.g., throughput of a connection), or custom application metrics (e.g.,
quality score). Once the orchestrator receives an event, users can take
adaptation actions by using the orchestrator actuation APIs. We demonstrate the
use of the orchestrator in IBM's System S in the context of three different
applications, illustrating application adaptation to changes on the incoming
data distribution, to application failures, and on-demand dynamic composition.Comment: VLDB201
ADWISE: Adaptive Window-based Streaming Edge Partitioning for High-Speed Graph Processing
In recent years, the graph partitioning problem gained importance as a
mandatory preprocessing step for distributed graph processing on very large
graphs. Existing graph partitioning algorithms minimize partitioning latency by
assigning individual graph edges to partitions in a streaming manner --- at the
cost of reduced partitioning quality. However, we argue that the mere
minimization of partitioning latency is not the optimal design choice in terms
of minimizing total graph analysis latency, i.e., the sum of partitioning and
processing latency. Instead, for complex and long-running graph processing
algorithms that run on very large graphs, it is beneficial to invest more time
into graph partitioning to reach a higher partitioning quality --- which
drastically reduces graph processing latency. In this paper, we propose ADWISE,
a novel window-based streaming partitioning algorithm that increases the
partitioning quality by always choosing the best edge from a set of edges for
assignment to a partition. In doing so, ADWISE controls the partitioning
latency by adapting the window size dynamically at run-time. Our evaluations
show that ADWISE can reach the sweet spot between graph partitioning latency
and graph processing latency, reducing the total latency of partitioning plus
processing by up to 23-47 percent compared to the state-of-the-art
- …