12 research outputs found

    Case Studies on Optimizing Algorithms for GPU Architectures

    Get PDF
    Modern GPUs are complex, massively multi-threaded, and high-performance. Programmers naturally gravitate towards taking advantage of this high performance for achieving faster results. However, in order to do so successfully, programmers must first understand and then master a new set of skills – writing parallel code, using different types of parallelism, adapting to GPU architectural features, and understanding issues that limit performance. In order to ease this learning process and help GPU programmers become productive more quickly, this dissertation introduces three data access skeletons (DASks) – Block, Column, and Row -- and two block access skeletons (BASks) – Block-By-Block and Warp-by-Warp. Each “skeleton” provides a high-performance implementation framework that partitions data arrays into data blocks and then iterates over those blocks. The programmer must still write “body” methods on individual data blocks to solve their specific problem. These skeletons provide efficient machine dependent data access patterns for use on GPUs. DASks group n data elements into m fixed size data blocks. These m data block are then partitioned across p thread blocks using a 1D or 2D layout pattern. The fixed-size data blocks are parameterized using three C++ template parameters – nWork, WarpSize, and nWarps. Generic programming techniques use these three parameters to enable performance experiments on three different types of parallelism – instruction-level parallelism (ILP), data-level parallelism (DLP), and thread-level parallelism (TLP). These different DASks and BASks are introduced using a simple memory I/O (Copy) case study. A nearest neighbor search case study resulted in the development of DASKs and BASks but does not use these skeletons itself. Three additional case studies – Reduce/Scan, Histogram, and Radix Sort -- demonstrate DASks and BASks in action on parallel primitives and also provides more valuable performance lessons.Doctor of Philosoph

    Accelerating transitive closure of large-scale sparse graphs

    Get PDF
    Finding the transitive closure of a graph is a fundamental graph problem where another graph is obtained in which an edge exists between two nodes if and only if there is a path in our graph from one node to the other. The reachability matrix of a graph is its transitive closure. This thesis describes a novel approach that uses anti-sections to obtain the transitive closure of a graph. It also examines its advantages when implemented in parallel on a CPU using the Hornet graph data structure. Graph representations of real-world systems are typically sparse in nature due to lesser connectivity between nodes. The anti-section approach is designed specifically to improve performance for large scale sparse graphs. The NVIDIA Titan V CPU is used for the execution of the anti-section parallel implementations. The Dual-Round and Hash-Based implementations of the Anti-Section transitive closure approach provide a significant speedup over several parallel and sequential implementations

    The observation of extended sources with the Hartebeesthoek radio telescope

    Get PDF
    The Hartebeesthoek Radio Telescope is well suited to mapping large areas of sky at 2.3 GHz because of the stability and sensitivity of the noise-adding radiometer (Nicolson, 1970) and cryogenic amplifier used at this frequency, the relatively large 20' beam of the 26 m dish antenna, and its high-speed drive capability. Telescope control programs were written for the Observatory's online computer for automated mapping. Effort centred on removing the curved baseline or 'background' from each Declination (Dec) scan, due to atmospheric and ground radiation contributions varying as the antenna is scanned. Initially these backgrounds were measured over a wide range of Hour Angle (HA) for the Dec range of a map, and an interpolated curve subtracted from each on-source scan for its HA. A common base level was established by comparison with drift scans (observed with the antenna stationary). These different observations (on- and off-source Dec scans and drift scans) were combined into one in the Skymap system by performing Dec scans at a fixed starting HA for a period long enough to permit 'cold sky' and the source to drift through. A background formed by fitting a smooth curve through the lowest sample at each Dec provides a consistent relative base level for all the scans in an observation. A high scanning speed is used so that observations may fruitfully be repeated three times and interleaved to build a reliable, fully sampled map. As each observation has its own background removed, it may be made at any HA. For comparison, maps of Upper Scorpio produced by the earlier method (Baart et al., 1980) and the Magellanic Cloud region produced by Skymap (Mountfort et al., 1987) are shown. Skymap provides a simple and flexible mapping method which relies on the stability of the noise-adding radiometer and high-speed repeated scans to produce good maps of large or small extent with little computation. Correction for drift is more difficult than with systems which use intersecting scans, such as the 'nodding' scans used by Haslam et al. (1981) or the Azimuth scans of Reich (1982)

    LIPIcs, Volume 274, ESA 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 274, ESA 2023, Complete Volum

    Eight Biennial Report : April 2005 – March 2007

    No full text

    Preface

    Get PDF
    corecore