2,311 research outputs found

    Deterministic 1-k routing on meshes with applications to worm-hole routing

    Get PDF
    In 11-kk routing each of the n2n^2 processing units of an n×nn \times n mesh connected computer initially holds 11 packet which must be routed such that any processor is the destination of at most kk packets. This problem reflects practical desire for routing better than the popular routing of permutations. 11-kk routing also has implications for hot-potato worm-hole routing, which is of great importance for real world systems. We present a near-optimal deterministic algorithm running in \sqrt{k} \cdot n / 2 + \go{n} steps. We give a second algorithm with slightly worse routing time but working queue size three. Applying this algorithm considerably reduces the routing time of hot-potato worm-hole routing. Non-trivial extensions are given to the general ll-kk routing problem and for routing on higher dimensional meshes. Finally we show that kk-kk routing can be performed in \go{k \cdot n} steps with working queue size four. Hereby the hot-potato worm-hole routing problem can be solved in \go{k^{3/2} \cdot n} steps

    A Comparison of Meshes With Static Buses and Unidirectional Wrap-Arounds

    Get PDF
    We investigate the relative computational powers of a mesh with static buses and a mesh with unidirectional wrap-mounds. A mesh with unidirectional wraparounds is a torus with the restriction that any wraparoundlink of the architecture can only transmit data in one of the two directions at any clock tick. We show that the problem of packet routing can be solved as efficiently on a linear array with unidirectional wrap-around link as on a linear array with a broadcast bus. We also present a routing algorithm for a twcdimensional torus with unidirectional wraparound links whose run time is close to that of the best known algorithm for routing on a mesh with broadcast buses in each dimension. In addition, we show that on a mesh with broadcast buses, sorting can be done in time that is essentially the same as the time needed for packet routing

    Towards practical permutation routing on meshes

    Get PDF
    We consider the permutation routing problem on two-dimensional n×nn \times n meshes. To be practical, a routing algorithm is required to ensure very small queue sizes QQ, and very low running time TT, not only asymptotically but particularly also for the practically important nn up to 10001000. With a technique inspired by a scheme of Kaklamanis/Krizanc/Rao, we obtain a near-optimal result: T=2⋅n+O(1)T = 2 \cdot n + {\cal O}(1) with Q=2Q = 2. Although QQ is very attractive now, the lower order terms in TT make this algorithm highly impractical. Therefore we present simple schemes which are asymptotically slower, but have TT around 3⋅n3 \cdot n for {\em all} nn and QQ between 2 and 8

    Doctor of Philosophy

    Get PDF
    dissertationMemory access irregularities are a major bottleneck for bandwidth limited problems on Graphics Processing Unit (GPU) architectures. GPU memory systems are designed to allow consecutive memory accesses to be coalesced into a single memory access. Noncontiguous accesses within a parallel group of threads working in lock step may cause serialized memory transfers. Irregular algorithms may have data-dependent control flow and memory access, which requires runtime information to be evaluated. Compile time methods for evaluating parallelism, such as static dependence graphs, are not capable of evaluating irregular algorithms. The goals of this dissertation are to study irregularities within the context of unstructured mesh and sparse matrix problems, analyze the impact of vectorization widths on irregularities, and present data-centric methods that improve control flow and memory access irregularity within those contexts. Reordering associative operations has often been exploited for performance gains in parallel algorithms. This dissertation presents a method for associative reordering of stencil computations over unstructured meshes that increases data reuse through caching. This novel parallelization scheme offers considerable speedups over standard methods. Vectorization widths can have significant impact on performance in vectorized computations. Although the hardware vector width is generally fixed, the logical vector width used within a computation can range from one up to the width of the computation. Significant performance differences can occur due to thread scheduling and resource limitations. This dissertation analyzes the impact of vectorization widths on dense numerical computations such as 3D dG postprocessing. It is difficult to efficiently perform dynamic updates on traditional sparse matrix formats. Explicitly controlling memory segmentation allows for in-place dynamic updates in sparse matrices. Dynamically updating the matrix without rebuilding or sorting greatly improves processing time and overall throughput. This dissertation presents a new sparse matrix format, dynamic compressed sparse row (DCSR), which allows for dynamic streaming updates to a sparse matrix. A new method for parallel sparse matrix-matrix multiplication (SpMM) that uses dynamic updates is also presented
    • …
    corecore