19,481 research outputs found

    HYPE: Massive Hypergraph Partitioning with Neighborhood Expansion

    Full text link
    Many important real-world applications-such as social networks or distributed data bases-can be modeled as hypergraphs. In such a model, vertices represent entities-such as users or data records-whereas hyperedges model a group membership of the vertices-such as the authorship in a specific topic or the membership of a data record in a specific replicated shard. To optimize such applications, we need an efficient and effective solution to the NP-hard balanced k-way hypergraph partitioning problem. However, existing hypergraph partitioners that scale to very large graphs do not effectively exploit the hypergraph structure when performing the partitioning decisions. We propose HYPE, a hypergraph partitionier that exploits the neighborhood relations between vertices in the hypergraph using an efficient implementation of neighborhood expansion. HYPE improves partitioning quality by up to 95% and reduces runtime by up to 39% compared to streaming partitioning.Comment: To appear in Proceedings of IEEE 2018 International Conference on Big Data (BigData '18), 10 page

    Graph Partitioning via Parallel Submodular Approximation to Accelerate Distributed Machine Learning

    Full text link
    Distributed computing excels at processing large scale data, but the communication cost for synchronizing the shared parameters may slow down the overall performance. Fortunately, the interactions between parameter and data in many problems are sparse, which admits efficient partition in order to reduce the communication overhead. In this paper, we formulate data placement as a graph partitioning problem. We propose a distributed partitioning algorithm. We give both theoretical guarantees and a highly efficient implementation. We also provide a highly efficient implementation of the algorithm and demonstrate its promising results on both text datasets and social networks. We show that the proposed algorithm leads to 1.6x speedup of a state-of-the-start distributed machine learning system by eliminating 90\% of the network communication

    Physics-Aware Neural Networks for Distribution System State Estimation

    Full text link
    The distribution system state estimation problem seeks to determine the network state from available measurements. Widely used Gauss-Newton approaches are very sensitive to the initialization and often not suitable for real-time estimation. Learning approaches are very promising for real-time estimation, as they shift the computational burden to an offline training stage. Prior machine learning approaches to power system state estimation have been electrical model-agnostic, in that they did not exploit the topology and physical laws governing the power grid to design the architecture of the learning model. In this paper, we propose a novel learning model that utilizes the structure of the power grid. The proposed neural network architecture reduces the number of coefficients needed to parameterize the mapping from the measurements to the network state by exploiting the separability of the estimation problem. This prevents overfitting and reduces the complexity of the training stage. We also propose a greedy algorithm for phasor measuring units placement that aims at minimizing the complexity of the neural network required for realizing the state estimation mapping. Simulation results show superior performance of the proposed method over the Gauss-Newton approach.Comment: 8 pages, 5 figures, 3 table

    Social Hash Partitioner: A Scalable Distributed Hypergraph Partitioner

    Full text link
    We design and implement a distributed algorithm for balanced kk-way hypergraph partitioning that minimizes fanout, a fundamental hypergraph quantity also known as the communication volume and (k−1k-1)-cut metric, by optimizing a novel objective called probabilistic fanout. This choice allows a simple local search heuristic to achieve comparable solution quality to the best existing hypergraph partitioners. Our algorithm is arbitrarily scalable due to a careful design that controls computational complexity, space complexity, and communication. In practice, we commonly process hypergraphs with billions of vertices and hyperedges in a few hours. We explain how the algorithm's scalability, both in terms of hypergraph size and bucket count, is limited only by the number of machines available. We perform an extensive comparison to existing distributed hypergraph partitioners and find that our approach is able to optimize hypergraphs roughly 100100 times bigger on the same set of machines. We call the resulting tool Social Hash Partitioner (SHP), and accompanying this paper, we open-source the most scalable version based on recursive bisection.Comment: Proceedings of the VLDB Endowment 201

    DRONE: a Distributed Subgraph-Centric Framework for Processing Large Scale Power-law Graphs

    Full text link
    Nowadays, in the big data era, social networks, graph databases, knowledge graphs, electronic commerce etc. demand efficient and scalable capability to process an ever increasing volume of graph-structured data. To meet the challenge, two mainstream distributed programming models, vertex-centric (VC) and subgraph-centric (SC) were proposed. Compared to the VC model, the SC model converges faster with less communication overhead on well-partitioned graphs, and is easy to program due to the "think like a graph" philosophy. The edge-cut method is considered as a natural choice of subgraph-centric model for graph partitioning, and has been adopted by Giraph++, Blogel and GRAPE. However, the edge-cut method causes significant performance bottleneck for processing large scale power-law graphs. Thus, the SC model is less competitive in practice. In this paper, we present an innovative distributed graph computing framework, DRONE (Distributed gRaph cOmputiNg Engine). It combines the subgraph-centric model and the vertex-cut graph partitioning strategy. Experiments show that DRONE outperforms the state-of-art distributed graph computing engines on real-world graphs and synthetic power-law graphs. DRONE is capable of scaling up to process one-trillion-edge synthetic power-law graphs, which is orders of magnitude larger than previously reported by existing SC-based frameworks.Comment: 13 pages, 9 figure

    Device Placement Optimization with Reinforcement Learning

    Full text link
    The past few years have witnessed a growth in size and computational requirements for training and inference with neural networks. Currently, a common approach to address these requirements is to use a heterogeneous distributed environment with a mixture of hardware devices such as CPUs and GPUs. Importantly, the decision of placing parts of the neural models on devices is often made by human experts based on simple heuristics and intuitions. In this paper, we propose a method which learns to optimize device placement for TensorFlow computational graphs. Key to our method is the use of a sequence-to-sequence model to predict which subsets of operations in a TensorFlow graph should run on which of the available devices. The execution time of the predicted placements is then used as the reward signal to optimize the parameters of the sequence-to-sequence model. Our main result is that on Inception-V3 for ImageNet classification, and on RNN LSTM, for language modeling and neural machine translation, our model finds non-trivial device placements that outperform hand-crafted heuristics and traditional algorithmic methods.Comment: To appear at ICML 201

    Search-based Tier Assignment for Optimising Offline Availability in Multi-tier Web Applications

    Full text link
    Web programmers are often faced with several challenges in the development process of modern, rich internet applications. Technologies for the different tiers of the application have to be selected: a server-side language, a combination of JavaScript, HTML and CSS for the client, and a database technology. Meeting the expectations of contemporary web applications requires even more effort from the developer: many state of the art libraries must be mastered and glued together. This leads to an impedance mismatch problem between the different technologies and it is up to the programmer to align them manually. Multi-tier or tierless programming is a web programming paradigm that provides one language for the different tiers of the web application, allowing the programmer to focus on the actual program logic instead of the accidental complexity that comes from combining several technologies. While current tierless approaches therefore relieve the burden of having to combine different technologies into one application, the distribution of the code is explicitly tied into the program. Certain distribution decisions have an impact on crosscutting concerns such as information security or offline availability. Moreover, adapting the programs such that the application complies better with these concerns often leads to code tangling, rendering the program more difficult to understand and maintain. We introduce an approach to multi-tier programming where the tierless code is decoupled from the tier specification. The developer implements the web application in terms of slices and an external specification that assigns the slices to tiers. A recommender system completes the picture for those slices that do not have a fixed placement and proposes slice refinements as well. This recommender system tries to optimise the tier specification with respect to one or more crosscutting concerns. This is in contrast with current cutting edge solutions that hide distribution decisions from the programmer. In this paper we show that slices, together with a recommender system, enable the developer to experiment with different placements of slices, until the distribution of the code satisfies the programmer's needs. We present a search-based recommender system that maximises the offline availability of a web application and a concrete implementation of these concepts in the tier-splitting tool Stip.js

    Accelerated Training for CNN Distributed Deep Learning through Automatic Resource-Aware Layer Placement

    Full text link
    The Convolutional Neural Network (CNN) model, often used for image classification, requires significant training time to obtain high accuracy. To this end, distributed training is performed with the parameter server (PS) architecture using multiple servers. Unfortunately, scalability has been found to be poor in existing architectures. We find that the PS network is the bottleneck as it communicates a large number of gradients and parameters with the many workers. This is because synchronization with the many workers has to occur at every step of training. Depending on the model, communication can be in the several hundred MBs per synchronization. In this paper, we propose a scheme to reduce network traffic through layer placement that considers the resources that each layer uses. Through analysis of the characteristics of CNN, we find that placement of layers can be done in an effective manner. We then incorporate this observation within the TensorFlow framework such that layers can be automatically placed for more efficient training. Our evaluation making use of this placement scheme show that training time can be significantly reduced without loss of accuracy for many CNN models

    Building User-defined Runtime Adaptation Routines for Stream Processing Applications

    Full text link
    Stream processing applications are deployed as continuous queries that run from the time of their submission until their cancellation. This deployment mode limits developers who need their applications to perform runtime adaptation, such as algorithmic adjustments, incremental job deployment, and application-specific failure recovery. Currently, developers do runtime adaptation by using external scripts and/or by inserting operators into the stream processing graph that are unrelated to the data processing logic. In this paper, we describe a component called orchestrator that allows users to write routines for automatically adapting the application to runtime conditions. Developers build an orchestrator by registering and handling events as well as specifying actuations. Events can be generated due to changes in the system state (e.g., application component failures), built-in system metrics (e.g., throughput of a connection), or custom application metrics (e.g., quality score). Once the orchestrator receives an event, users can take adaptation actions by using the orchestrator actuation APIs. We demonstrate the use of the orchestrator in IBM's System S in the context of three different applications, illustrating application adaptation to changes on the incoming data distribution, to application failures, and on-demand dynamic composition.Comment: VLDB201

    ADWISE: Adaptive Window-based Streaming Edge Partitioning for High-Speed Graph Processing

    Full text link
    In recent years, the graph partitioning problem gained importance as a mandatory preprocessing step for distributed graph processing on very large graphs. Existing graph partitioning algorithms minimize partitioning latency by assigning individual graph edges to partitions in a streaming manner --- at the cost of reduced partitioning quality. However, we argue that the mere minimization of partitioning latency is not the optimal design choice in terms of minimizing total graph analysis latency, i.e., the sum of partitioning and processing latency. Instead, for complex and long-running graph processing algorithms that run on very large graphs, it is beneficial to invest more time into graph partitioning to reach a higher partitioning quality --- which drastically reduces graph processing latency. In this paper, we propose ADWISE, a novel window-based streaming partitioning algorithm that increases the partitioning quality by always choosing the best edge from a set of edges for assignment to a partition. In doing so, ADWISE controls the partitioning latency by adapting the window size dynamically at run-time. Our evaluations show that ADWISE can reach the sweet spot between graph partitioning latency and graph processing latency, reducing the total latency of partitioning plus processing by up to 23-47 percent compared to the state-of-the-art
    • …
    corecore