250 research outputs found
Load balancing fictions, falsehoods and fallacies
Abstract Effective use of a parallel computer requires that a calculation be carefully divided among the processors. This load balancing problem appears in many guises and has been a fervent area of research for the past decade or more. Although great progress has been made, and useful software tools developed, a number of challenges remain. It is the conviction of the author that these challenges will be easier to address if we first come to terms with some significant shortcomings in our current perspectives. This paper tries to identify several areas in which the prevailing point of view is either mistaken or insufEcient. The goal is to motivate new ideas and directions for this important field
Load Duration and Probability Based Design of Wood Structural Members
Methods are presented for calculating limit state probabilities of engineered wood structural members, considering load duration effects due to stochastic dead and snow load. These methods are used to conduct reliability studies of existing wood design criteria. When realistic load processes are considered, it is found that the importance of load duration and gradual damage accumulation has been somewhat overstated. One possible probability-based design method that should be useful in future code development work also is presented
Exploiting flexibly assignable work to improve load balance
In many applications of parallel computing, distribution of the data unambiguously implies distribution of work among processors. But there are exceptions where some tasks can be assigned to one of several processors without altering the total volume of communication. In this paper, we study the problem of exploiting this flexibility in assignment of tasks to improve load balance. We first model the problem in terms of network flow and use combinatorial techniques for its solution. Our parametric search algorithms use maximum flow algorithms for probing on a candidate optimal solution value. We describe two algorithms to solve the assignment problem with log W{sub T} and |P| probe calls, where W{sub T} and |P|, respectively, denote the total workload and number of processors. We also define augmenting paths and cuts for this problem, and show that any algorithm based on augmenting paths can be used to find an optimal solution for the task assignment problem. We then consider a continuous version of the problem, and formulate it as a linearly constrained optimization problem, i.e., min ||Ax||{sub {infinity}}, s.t. Bx = d. To avoid solving an intractable {infinity}-norm optimization problem, we show that in this case minimizing the 2-norm is sufficient to minimize the {infinity}-norm, which reduces the problem to the well-studied linearly-constrained least squares problem. The continuous version of the problem has the advantage of being easily amenable to parallelization
A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L.
Abstract Many emerging large-scale data science applications require searching large graphs distributed across multiple memories and processors. This paper presents a distributed breadth-first search (BFS) scheme that scales for random graphs with up to three billion vertices and 30 billion edges. Scalability was tested on IBM BlueGene/L with 32,768 nodes at the Lawrence Livermore National Laboratory. Scalability was obtained through a series of optimizations, in particular, those that ensure scalable use of memory. We use 2D (edge) partitioning of the graph instead of conventional 1D (vertex) partitioning to reduce communication overhead. For Poisson random graphs, we show that the expected size of the messages is scalable for both 2D and 1D partitionings. Finally, we have developed efficient collective communication functions for the 3D torus architecture of BlueGene/L that also take advantage of the structure in the problem. The performance and characteristics of the algorithm are measured and reported
Tolerating the Community Detection Resolution Limit with Edge Weighting
Communities of vertices within a giant network such as the World-Wide Web are
likely to be vastly smaller than the network itself. However, Fortunato and
Barth\'{e}lemy have proved that modularity maximization algorithms for
community detection may fail to resolve communities with fewer than
edges, where is the number of edges in the entire network.
This resolution limit leads modularity maximization algorithms to have
notoriously poor accuracy on many real networks. Fortunato and Barth\'{e}lemy's
argument can be extended to networks with weighted edges as well, and we derive
this corollary argument. We conclude that weighted modularity algorithms may
fail to resolve communities with fewer than total edge
weight, where is the total edge weight in the network and is the
maximum weight of an inter-community edge. If is small, then small
communities can be resolved.
Given a weighted or unweighted network, we describe how to derive new edge
weights in order to achieve a low , we modify the ``CNM'' community
detection algorithm to maximize weighted modularity, and show that the
resulting algorithm has greatly improved accuracy. In experiments with an
emerging community standard benchmark, we find that our simple CNM variant is
competitive with the most accurate community detection methods yet proposed.Comment: revision with 8 pages 3 figures 2 table
Parallel Shortest Path Algorithms for Solving Large-Scale Instances
We present an experimental study of parallel algorithms for solving the single source
shortest path problem with non-negative edge weights (NSSP) on large-scale graphs.
We implement Meyer and Sander's Δ-stepping algorithm and report performance results on the Cray MTA-2, a multithreaded parallel architecture. The MTA-2 is a
high-end shared memory system offering two unique features that aid the efficient implementation of irregular parallel graph algorithms: the ability to exploit fine-grained
parallelism, and low-overhead synchronization primitives. Our implementation exhibits
remarkable parallel speedup when compared with a competitive sequential algorithm,
for low-diameter sparse graphs. For instance, Δ-stepping on a directed scale-free graph
of 100 million vertices and 1 billion edges takes less than ten seconds on 40 processors
of the MTA-2, with a relative speedup of close to 30. To our knowledge, these are the
first performance results of a parallel NSSP problem on realistic graph instances in the
order of billions of vertices and edges
- …