Search CORE

34,078 research outputs found

A simple parallel prefix algorithm for compact finite-difference schemes

Author: Joslin Ronald D.
Sun Xian-He
Publication venue
Publication date
Field of study

A compact scheme is a discretization scheme that is advantageous in obtaining highly accurate solutions. However, the resulting systems from compact schemes are tridiagonal systems that are difficult to solve efficiently on parallel computers. Considering the almost symmetric Toeplitz structure, a parallel algorithm, simple parallel prefix (SPP), is proposed. The SPP algorithm requires less memory than the conventional LU decomposition and is highly efficient on parallel machines. It consists of a prefix communication pattern and AXPY operations. Both the computation and the communication can be truncated without degrading the accuracy when the system is diagonally dominant. A formal accuracy study was conducted to provide a simple truncation formula. Experimental results were measured on a MasPar MP-1 SIMD machine and on a Cray 2 vector machine. Experimental results show that the simple parallel prefix algorithm is a good algorithm for the compact scheme on high-performance computers

NASA Technical Reports Server

Optimistic barrier synchronization

Author: Nicol David M.
Publication venue
Publication date
Field of study

Barrier synchronization is fundamental operation in parallel computation. In many contexts, at the point a processor enters a barrier it knows that it has already processed all the work required of it prior to synchronization. The alternative case, when a processor cannot enter a barrier with the assurance that it has already performed all the necessary pre-synchronization computation, is treated. The problem arises when the number of pre-sychronization messages to be received by a processor is unkown, for example, in a parallel discrete simulation or any other computation that is largely driven by an unpredictable exchange of messages. We describe an optimistic O(log sup 2 P) barrier algorithm for such problems, study its performance on a large-scale parallel system, and consider extensions to general associative reductions as well as associative parallel prefix computations

NASA Technical Reports Server

Functional and dynamic programming in the design of parallel prefix networks

Author: Sheeran Mary
Publication venue
Publication date: 01/01/2010
Field of study

A parallel prefix network of width n takes n inputs, a1, a2, . . ., an, and computes each yi = a1 ○ a2 ○ ⋅ ⋅ ⋅ ○ ai for 1 ≤ i ≤ n, for an associative operator ○. This is one of the fundamental problems in computer science, because it gives insight into how parallel computation can be used to solve an apparently sequential problem. As parallel programming becomes the dominant programming paradigm, parallel prefix or scan is proving to be a very important building block of parallel algorithms and applications. There are many different parallel prefix networks, with different properties such as number of operators, depth and allowed fanout from the operators. In this paper, ideas from functional programming are combined with search to enable a deep exploration of parallel prefix network design. Networks that improve on the best known previous results are generated. It is argued that precise modelling in a functional programming language, together with simple visualization of the networks, gives a new, more experimental, approach to parallel prefix network design, improving on the manual techniques typically employed in the literature. The programming idiom that marries search with higher order functions may well have wider application than the network generation described here

Chalmers Research

Chalmers Publication Library

Pipelining Saturated Accumulation

Author: Chan Stephanie
DeHon André
Kapre Nachiket
Papadantonakis Karl
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 02/04/2008
Field of study

Aggressive pipelining and spatial parallelism allow integrated circuits (e.g., custom VLSI, ASICs, and FPGAs) to achieve high throughput on many Digital Signal Processing applications. However, cyclic data dependencies in the computation can limit parallelism and reduce the efficiency and speed of an implementation. Saturated accumulation is an important example where such a cycle limits the throughput of signal processing applications. We show how to reformulate saturated addition as an associative operation so that we can use a parallel-prefix calculation to perform saturated accumulation at any data rate supported by the device. This allows us, for example, to design a 16-bit saturated accumulator which can operate at 280 MHz on a Xilinx Spartan-3(XC3S-5000-4) FPGA, the maximum frequency supported by the component's DCM

CiteSeerX

Caltech Authors

Work-stealing prefix scan: Addressing load imbalance in large-scale image registration

Author: Berkels Benjamin
Bientinesi Paolo
Copik Marcin
Grosser Tobias
Hoefler Torsten
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2022
Field of study

Parallelism patterns (e.g., map or reduce) have proven to be effective tools for parallelizing high-performance applications. In this article, we study the recursive registration of a series of electron microscopy images - a time consuming and imbalanced computation necessary for nano-scale microscopy analysis. We show that by translating the image registration into a specific instance of the prefix scan, we can convert this seemingly sequential problem into a parallel computation that scales to over thousand of cores. We analyze a variety of scan algorithms that behave similarly for common low-compute operators and propose a novel work-stealing procedure for a hierarchical prefix scan. Our evaluation shows that by identifying a suitable and well-optimized prefix scan algorithm, we reduce time-to-solution on a series of 4,096 images spanning ten seconds of microscopy acquisition from over 10 hours to less than 3 minutes (using 1024 Intel Haswell cores), enabling derivation of material properties at nanoscale for long microscopy image series.ISSN:1045-9219ISSN:1558-2183ISSN:2161-988

Repository for Publications and Research Data

Edinburgh Research Explorer

Publikationsserver der RWTH Aachen University

Optimistic Parallelization of Floating-Point Accumulation

Author: DeHon André
Kapre Nachiket
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2007
Field of study

Floating-point arithmetic is notoriously non-associative due to the limited precision representation which demands intermediate values be rounded to fit in the available precision. The resulting cyclic dependency in floating-point accumulation inhibits parallelization of the computation, including efficient use of pipelining. In practice, however, we observe that floating-point operations are "mostly" associative. This observation can be exploited to parallelize floating-point accumulation using a form of optimistic concurrency. In this scheme, we first compute an optimistic associative approximation to the sum and then relax the computation by iteratively propagating errors until the correct sum is obtained. We map this computation to a network of 16 statically-scheduled, pipelined, double-precision floating-point adders on the Virtex-4 LX160 (-12) device where each floating-point adder runs at 296 MHz and has a pipeline depth of 10. On this 16 PE design, we demonstrate an average speedup of 6× with randomly generated data and 3-7× with summations extracted from Conjugate Gradient benchmarks

CiteSeerX

Crossref

Caltech Authors

ScholarlyCommons@Penn

An Efficient Parallel IP Lookup Technique for IPv6 Routers Using Multiple Hashing with Ternary marker storage

Author: POKKULURI KIRAN SREE
Publication venue: 'University of Technology, Sydney (UTS)'
Publication date: 19/10/2011
Field of study

Internet address lookup is a challenging problem because of the increasing routing table sizes, increased traffic, higher speed links, and the migration to 128 bit IPv6 addresses. Routing lookup involves computation of best matching prefix for which existing solutions scale poorly when traffic in the router increases or when employed for IPV6 address lookup. Our paper describes a novel approach which employs multiple hashing on reduced number of hash tables on which ternary search on levels is applied in parallel. This scheme handles large number of prefixes generated by controlled prefix expansion by reducing collision and distributing load fairly in the hash buckets thus providing faster worst case and average case lookups. The approach we describe is fast, simple, scalable, parallelizable, and flexible

Crossref

UTS ePress

Recommended from our members

Crosslinking in parallel

Author: Asuri Hari S.
Publication venue: eScholarship, University of California
Publication date: 01/01/1992
Field of study

A crosslink is a double link established between the two entries of an edge in an adjacency list representation of a graph. Crosslinks play important roles in several parallel algorithms as they provide constant time access between the two entries of an edge; the existence of crosslinks is usually assumed. We consider the problem of establishing crosslinks in a crosslink-less adjacency list for graphs that belong to a class of graphs called the linearly contractible graphs, and show that cross-links can be established optimally in O(log n log*n) time using a CREW PRAM and optimally in O(log n) time using a CRCW PRAM for such graphs

eScholarship - University of California