Search CORE

2,311 research outputs found

Deterministic 1-k routing on meshes with applications to worm-hole routing

Author: Kaufmann M.
Sibeyn J.
Publication venue: Max-Planck-Institut für Informatik
Publication date: 01/01/1993
Field of study

1

k

routing each of the

n^2

processing units of an

n \times n

mesh connected computer initially holds

1

packet which must be routed such that any processor is the destination of at most

k

packets. This problem reflects practical desire for routing better than the popular routing of permutations.

1

k

routing also has implications for hot-potato worm-hole routing, which is of great importance for real world systems. We present a near-optimal deterministic algorithm running in \sqrt{k} \cdot n / 2 + \go{n} steps. We give a second algorithm with slightly worse routing time but working queue size three. Applying this algorithm considerably reduces the routing time of hot-potato worm-hole routing. Non-trivial extensions are given to the general

l

k

routing problem and for routing on higher dimensional meshes. Finally we show that

k

k

routing can be performed in \go{k \cdot n} steps with working queue size four. Hereby the hot-potato worm-hole routing problem can be solved in \go{k^{3/2} \cdot n} steps

A Comparison of Meshes With Static Buses and Unidirectional Wrap-Arounds

Author: Krizanc Danny
Rajasekaran Sanguthevar
Shende Sunil M.
Publication venue: ScholarlyCommons
Publication date: 01/07/1992
Field of study

We investigate the relative computational powers of a mesh with static buses and a mesh with unidirectional wrap-mounds. A mesh with unidirectional wraparounds is a torus with the restriction that any wraparoundlink of the architecture can only transmit data in one of the two directions at any clock tick. We show that the problem of packet routing can be solved as efficiently on a linear array with unidirectional wrap-around link as on a linear array with a broadcast bus. We also present a routing algorithm for a twcdimensional torus with unidirectional wraparound links whose run time is close to that of the best known algorithm for routing on a mesh with broadcast buses in each dimension. In addition, we show that on a mesh with broadcast buses, sorting can be done in time that is essentially the same as the time needed for packet routing

Towards practical permutation routing on meshes

Author: Kaufmann M.
Meyer U.
Sibeyn J.
Publication venue: Max-Planck-Institut für Informatik
Publication date: 01/01/1994
Field of study

We consider the permutation routing problem on two-dimensional

n \times n

meshes. To be practical, a routing algorithm is required to ensure very small queue sizes

Q

, and very low running time

T

, not only asymptotically but particularly also for the practically important

n

up to

1000

. With a technique inspired by a scheme of Kaklamanis/Krizanc/Rao, we obtain a near-optimal result:

T = 2 \cdot n + {\cal O}(1)

with

Q = 2

. Although

Q

is very attractive now, the lower order terms in

T

make this algorithm highly impractical. Therefore we present simple schemes which are asymptotically slower, but have

T

around

3 \cdot n

for {\em all}

n

and

Q

between 2 and 8

Doctor of Philosophy

Author: King James Sokhom
Publication venue: University of Utah
Publication date: 01/01/2017
Field of study

dissertationMemory access irregularities are a major bottleneck for bandwidth limited problems on Graphics Processing Unit (GPU) architectures. GPU memory systems are designed to allow consecutive memory accesses to be coalesced into a single memory access. Noncontiguous accesses within a parallel group of threads working in lock step may cause serialized memory transfers. Irregular algorithms may have data-dependent control flow and memory access, which requires runtime information to be evaluated. Compile time methods for evaluating parallelism, such as static dependence graphs, are not capable of evaluating irregular algorithms. The goals of this dissertation are to study irregularities within the context of unstructured mesh and sparse matrix problems, analyze the impact of vectorization widths on irregularities, and present data-centric methods that improve control flow and memory access irregularity within those contexts. Reordering associative operations has often been exploited for performance gains in parallel algorithms. This dissertation presents a method for associative reordering of stencil computations over unstructured meshes that increases data reuse through caching. This novel parallelization scheme offers considerable speedups over standard methods. Vectorization widths can have significant impact on performance in vectorized computations. Although the hardware vector width is generally fixed, the logical vector width used within a computation can range from one up to the width of the computation. Significant performance differences can occur due to thread scheduling and resource limitations. This dissertation analyzes the impact of vectorization widths on dense numerical computations such as 3D dG postprocessing. It is difficult to efficiently perform dynamic updates on traditional sparse matrix formats. Explicitly controlling memory segmentation allows for in-place dynamic updates in sparse matrices. Dynamically updating the matrix without rebuilding or sorting greatly improves processing time and overall throughput. This dissertation presents a new sparse matrix format, dynamic compressed sparse row (DCSR), which allows for dynamic streaming updates to a sparse matrix. A new method for parallel sparse matrix-matrix multiplication (SpMM) that uses dynamic updates is also presented