369 research outputs found

    The Glasgow Parallel Reduction Machine: Programming Shared-memory Many-core Systems using Parallel Task Composition

    Get PDF
    We present the Glasgow Parallel Reduction Machine (GPRM), a novel, flexible framework for parallel task-composition based many-core programming. We allow the programmer to structure programs into task code, written as C++ classes, and communication code, written in a restricted subset of C++ with functional semantics and parallel evaluation. In this paper we discuss the GPRM, the virtual machine framework that enables the parallel task composition approach. We focus the discussion on GPIR, the functional language used as the intermediate representation of the bytecode running on the GPRM. Using examples in this language we show the flexibility and power of our task composition framework. We demonstrate the potential using an implementation of a merge sort algorithm on a 64-core Tilera processor, as well as on a conventional Intel quad-core processor and an AMD 48-core processor system. We also compare our framework with OpenMP tasks in a parallel pointer chasing algorithm running on the Tilera processor. Our results show that the GPRM programs outperform the corresponding OpenMP codes on all test platforms, and can greatly facilitate writing of parallel programs, in particular non-data parallel algorithms such as reductions.Comment: In Proceedings PLACES 2013, arXiv:1312.221

    Exploiting Graphics Processing Units for Massively Parallel Multi-Dimensional Indexing

    Get PDF
    Department of Computer EngineeringScientific applications process truly large amounts of multi-dimensional datasets. To efficiently navigate such datasets, various multi-dimensional indexing structures, such as the R-tree, have been extensively studied for the past couple of decades. Since the GPU has emerged as a new cost-effective performance accelerator, now it is common to leverage the massive parallelism of the GPU in various applications such as medical image processing, computational chemistry, and particle physics. However, hierarchical multi-dimensional indexing structures are inherently not well suited for parallel processing because their irregular memory access patterns make it difficult to exploit massive parallelism. Moreover, recursive tree traversal often fails due to the small run-time stack and cache memory in the GPU. First, we propose Massively Parallel Three-phase Scanning (MPTS) R-tree traversal algorithm to avoid the irregular memory access patterns and recursive tree traversal so that the GPU can access tree nodes in a sequential manner. The experimental study shows that MPTS R-tree traversal algorithm consistently outperforms traditional recursive R-Tree search algorithm for multi-dimensional range query processing. Next, we focus on reducing the query response time and extending n-ary multi-dimensional indexing structures - R-tree, so that a large number of GPU threads cooperate to process a single query in parallel. Because the number of submitted concurrent queries in scientific data analysis applications is relatively smaller than that of enterprise database systems and ray tracing in computer graphics. Hence, we propose a novel variant of R-trees Massively Parallel Hilbert R-Tree (MPHR-Tree), which is designed for a novel parallel tree traversal algorithm Massively Parallel Restart Scanning (MPRS). The MPRS algorithm traverses the MPHR-Tree in mostly contiguous memory access patterns without recursion, which offers more chances to optimize the parallel SIMD algorithm. Our extensive experimental results show that the MPRS algorithm outperforms the other stackless tree traversal algorithms, which are designed for efficient ray tracing in computer graphics community. Furthermore, we develop query co-processing scheme that makes use of both the CPU and GPU. In this approach, we store the internal and leaf nodes of upper tree in CPU host memory and GPU device memory, respectively. We let the CPU traverse internal nodes because the conditional branches in hierarchical tree structures often cause a serious warp divergence problem in the GPU. For leaf nodes, the GPU scans a large number of leaf nodes in parallel based on the selection ratio of a given range query. It is well known that the GPU is superior to the CPU for parallel scanning. The experimental results show that our proposed multi-dimensional range query co-processing scheme improves the query response time by up to 12x and query throughput by up to 4x compared to the state-of-the-art GPU tree traversal algorithm.ope

    Study of Capacitor & Diode Aging effects on Output Ripple in Voltage Regulators and Prognostic Detection of Failure

    Get PDF
    Objectives: To design and simulate a buck converter and detector circuit which can prognostically indicate the power supply failure. Failure of Aluminium Electrolytic Capacitor (AEC) is considered as the parameter causing the power supply failure. To analyse variation of output ripple voltage due to possible changes in the Equivalent Series Resistance (ESR) and effective capacitance of the capacitor and design a detector to detect the failure of power supply prognostically.Methods:  A DC-DC buck converter in SMPS topology is designed by assuming an input voltage of 12V with 3 volts possible fluctuations and an output voltage of 3.3 volts is desired. Simulation is carried out to measure the variation in output ripple voltage caused due to aging of electrolytic capacitor using TINA by Texas Instruments. A detector is also designed to compare the ripple voltage and a predefined threshold voltage so as to indicate the possible failure of Switched Mode Power Supply (SMPS) well in advance by monitoring the output ripple increase.Novelty:    Having a fault tolerant power supply is very important in safety critical applications. Here by monitoring the output ripple variation, the degradation of AEC is predicted by calculating the ESR and capacitance variation. This simple yet effective prognostic detection will support in the design of fault tolerant power supplies.Highlight: It is found that, the ripple at the output increases with aging of the electrolytic capacitor, as with time the equivalent capacitance decreases and Equivalent Series Resistance (ESR) of the capacitor increases. The designed detector output is found to prognostically indicate the failure of SMPS

    Architecture--Performance Interrelationship Analysis In Single/Multiple Cpu/Gpu Computing Systems: Application To Composite Process Flow Modeling

    Get PDF
    Current developments in computing have shown the advantage of using one or more Graphic Processing Units (GPU) to boost the performance of many computationally intensive applications but there are still limits to these GPU-enhanced systems. The major factors that contribute to the limitations of GPU(s) for High Performance Computing (HPC) can be categorized as hardware and software oriented in nature. Understanding how these factors affect performance is essential to develop efficient and robust applications codes that employ one or more GPU devices as powerful co-processors for HPC computational modeling. The present work analyzes and understands the intrinsic interrelationship of both hardware and software categories on computational performance for single and multiple GPU-enhanced systems using a computationally intensive application that is representative of a large portion of challenges confronting modern HPC. The representative application uses unstructured finite element computations for transient composite resin infusion process flow modeling as the computational core, characteristics and results of which reflect many other HPC applications via the sparse matrix system used for the solution of linear system of equations. This work describes these various software and hardware factors and how they interact to affect performance of computationally intensive applications enabling more efficient development and porting of High Performance Computing applications that includes current, legacy, and future large scale computational modeling applications in various engineering and scientific disciplines

    Scalable Parallel Computers for Real-Time Signal Processing

    Get PDF
    We assess the state-of-the-art technology in massively parallel processors (MPPs) and their variations in different architectural platforms. Architectural and programming issues are identified in using MPPs for time-critical applications such as adaptive radar signal processing. We review the enabling technologies. These include high-performance CPU chips and system interconnects, distributed memory architectures, and various latency hiding mechanisms. We characterize the concept of scalability in three areas: resources, applications, and technology. Scalable performance attributes are analytically defined. Then we compare MPPs with symmetric multiprocessors (SMPs) and clusters of workstations (COWs). The purpose is to reveal their capabilities, limits, and effectiveness in signal processing. We evaluate the IBM SP2 at MHPCC, the Intel Paragon at SDSC, the Gray T3D at Gray Eagan Center, and the Gray T3E and ASCI TeraFLOP system proposed by Intel. On the software and programming side, we evaluate existing parallel programming environments, including the models, languages, compilers, software tools, and operating systems. Some guidelines for program parallelization are provided. We examine data-parallel, shared-variable, message-passing, and implicit programming models. Communication functions and their performance overhead are discussed. Available software tools and communication libraries are also introducedpublished_or_final_versio

    Analysis of Various Decentralized Load Balancing Techniques with Node Duplication

    Get PDF
    Experience in parallel computing is an increasingly necessary skill for today’s upcoming computer scientists as processors are hitting a serial execution performance barrier and turning to parallel execution for continued gains. The uniprocessor system has now reached its maximum speed limit and, there is very less scope to improve the speed of such type of system. To solve this problem multiprocessor system is used, which have more than one processor. Multiprocessor system improves the speed of the system but it again faces some problems like data dependency, control dependency, resource dependency and improper load balancing. So this paper presents a detailed analysis of various decentralized load balancing techniques with node duplication to reduce the proper execution time

    Parallel Out-of-Core Sorting: The Third Way

    Get PDF
    Sorting very large datasets is a key subroutine in almost any application that is built on top of a large database. Two ways to sort out-of-core data dominate the literature: merging-based algorithms and partitioning-based algorithms. Within these two paradigms, all the programs that sort out-of-core data on a cluster rely on assumptions about the input distribution. We propose a third way of out-of-core sorting: oblivious algorithms. In all, we have developed six programs that sort out-of-core data on a cluster. The first three programs, based completely on Leighton\u27s columnsort algorithm, have a restriction on the maximum problem size that they can sort. The other three programs relax this restriction; two are based on our original algorithmic extensions to columnsort. We present experimental results to show that our algorithms perform well. To the best of our knowledge, the programs presented in this thesis are the first to sort out-of-core data on a cluster without making any simplifying assumptions about the distribution of the data to be sorted

    Framework for parallel and distributed computation

    Get PDF
    Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2002.Includes bibliographical references (p. 45).Existing frameworks for parallel and distributed programming either provide poor support for runtime flexibility or are overly restricted in their range of target applications. Here we describe the interface and implementation of the Avalanche framework, which attempts to solve both of these problems. In addition, we present the runtime performance of a canonical parallel algorithm written using Avalanche.by Uriel Schafer.M.Eng

    A Survey on Parallel Architecture and Parallel Programming Languages and Tools

    Get PDF
    In this paper, we have presented a brief review on the evolution of parallel computing to multi - core architecture. The survey briefs more than 45 languages, libraries and tools used till date to increase performance through parallel programming. We ha ve given emphasis more on the architecture of parallel system in the survey
    corecore