19,184 research outputs found

    Design and Implementation of a Massively Parallel Version of DIRECT

    Get PDF
    This paper describes several massively parallel implementations for a global search algorithm DIRECT. Two parallel schemes take different approaches to address DIRECT's design challenges imposed by memory requirements and data dependency. Three design aspects in topology, data structures, and task allocation are compared in detail. The goal is to analytically investigate the strengths and weaknesses of these parallel schemes, identify several key sources of inefficiency, and experimentally evaluate a number of improvements in the latest parallel DIRECT implementation. The performance studies demonstrate improved data structure efficiency and load balancing on a 2200 processor cluster

    Performance Modeling and Analysis of a Massively Parallel DIRECT— Part 2

    Get PDF
    Modeling and analysis techniques are used to investigate the performance of a massively parallel version of DIRECT, a global search algorithm widely used in multidisciplinary design optimization applications. Several highdimensional benchmark functions and real world problems are used to test the design effectiveness under various problem structures. In this second part of a twopart work, theoretical and experimental results are compared for two parallel clusters with different system scale and network connectivity. The first part studied performance sensitivity to important parameters for problem configurations and parallel schemes, using performance metrics such as memory usage, load balancing, and parallel efficiency. Here linear regression models are used to characterize two major overhead sources—interprocessor communication and processor idleness—and also applied to the isoefficiency functions in scalability analysis. For a variety of highdimensional problems and large scale systems, the massively parallel design has achieved reasonable performance. The results of the performance study provide guidance for efficient problem and scheme configuration. More importantly, the design considerations and analysis techniques generalize to the transformation of other global search algorithms into effective large scale parallel optimization tools

    Stepwise transformation of algorithms into array processor architectures by the decomp

    Get PDF
    A formal approach for the transformation of computation intensive digital signal processing algorithms into suitable array processor architectures is presented. It covers the complete design flow from algorithmic specifications in a high-level programming language to architecture descriptions in a hardware description language. The transformation itself is divided into manageable design steps and implemented in the CAD-tool DECOMP which allows the exploration of different architectures in a short time. With the presented approach data independent algorithms can be mapped onto array processor architectures. To allow this, a known mapping methodology for array processor design is extended to handle inhomogeneous dependence graphs with nonregular data dependences. The implementation of the formal approach in the DECOMP is an important step towards design automation for massively parallel systems

    Performance Modeling and Analysis of a Massively Parallel DIRECT— Part 1

    Get PDF
    Modeling and analysis techniques are used to investigate the performance of a massively parallel version of DIRECT, a global search algorithm widely used in multidisciplinary design optimization applications. Several highdimensional benchmark functions and real world problems are used to test the design effectiveness under various problem structures. Theoretical and experimental results are compared for two parallel clusters with different system scale and network connectivity. The present work aims at studying the performance sensitivity to important parameters for problem configurations, parallel schemes, and system settings. The performance metrics include the memory usage, load balancing, parallel efficiency, and scalability. An analytical bounding model is constructed to measure the load balancing performance under different schemes. Additionally, linear regression models are used to characterize two major overhead sources—interprocessor communication and processor idleness, and also applied to the isoefficiency functions in scalability analysis. For a variety of highdimensional problems and large scale systems, the massively parallel design has achieved reasonable performance. The results of the performance study provide guidance for efficient problem and scheme configuration. More importantly, the generalized design considerations and analysis techniques are beneficial for transforming many global search algorithms to become effective large scale parallel optimization tools

    Runtime volume visualization for parallel CFD

    Get PDF
    This paper discusses some aspects of design of a data distributed, massively parallel volume rendering library for runtime visualization of parallel computational fluid dynamics simulations in a message-passing environment. Unlike the traditional scheme in which visualization is a postprocessing step, the rendering is done in place on each node processor. Computational scientists who run large-scale simulations on a massively parallel computer can thus perform interactive monitoring of their simulations. The current library provides an interface to handle volume data on rectilinear grids. The same design principles can be generalized to handle other types of grids. For demonstration, we run a parallel Navier-Stokes solver making use of this rendering library on the Intel Paragon XP/S. The interactive visual response achieved is found to be very useful. Performance studies show that the parallel rendering process is scalable with the size of the simulation as well as with the parallel computer

    Parallel Implementation of the PHOENIX Generalized Stellar Atmosphere Program. II: Wavelength Parallelization

    Get PDF
    We describe an important addition to the parallel implementation of our generalized NLTE stellar atmosphere and radiative transfer computer program PHOENIX. In a previous paper in this series we described data and task parallel algorithms we have developed for radiative transfer, spectral line opacity, and NLTE opacity and rate calculations. These algorithms divided the work spatially or by spectral lines, that is distributing the radial zones, individual spectral lines, or characteristic rays among different processors and employ, in addition task parallelism for logically independent functions (such as atomic and molecular line opacities). For finite, monotonic velocity fields, the radiative transfer equation is an initial value problem in wavelength, and hence each wavelength point depends upon the previous one. However, for sophisticated NLTE models of both static and moving atmospheres needed to accurately describe, e.g., novae and supernovae, the number of wavelength points is very large (200,000--300,000) and hence parallelization over wavelength can lead both to considerable speedup in calculation time and the ability to make use of the aggregate memory available on massively parallel supercomputers. Here, we describe an implementation of a pipelined design for the wavelength parallelization of PHOENIX, where the necessary data from the processor working on a previous wavelength point is sent to the processor working on the succeeding wavelength point as soon as it is known. Our implementation uses a MIMD design based on a relatively small number of standard MPI library calls and is fully portable between serial and parallel computers.Comment: AAS-TeX, 15 pages, full text with figures available at ftp://calvin.physast.uga.edu/pub/preprints/Wavelength-Parallel.ps.gz ApJ, in pres

    GRAPE-6: The massively-parallel special-purpose computer for astrophysical particle simulation

    Full text link
    In this paper, we describe the architecture and performance of the GRAPE-6 system, a massively-parallel special-purpose computer for astrophysical NN-body simulations. GRAPE-6 is the successor of GRAPE-4, which was completed in 1995 and achieved the theoretical peak speed of 1.08 Tflops. As was the case with GRAPE-4, the primary application of GRAPE-6 is simulation of collisional systems, though it can be used for collisionless systems. The main differences between GRAPE-4 and GRAPE-6 are (a) The processor chip of GRAPE-6 integrates 6 force-calculation pipelines, compared to one pipeline of GRAPE-4 (which needed 3 clock cycles to calculate one interaction), (b) the clock speed is increased from 32 to 90 MHz, and (c) the total number of processor chips is increased from 1728 to 2048. These improvements resulted in the peak speed of 64 Tflops. We also discuss the design of the successor of GRAPE-6.Comment: Accepted for publication in PASJ, scheduled to appear in Vol. 55, No.

    The language parallel Pascal and other aspects of the massively parallel processor

    Get PDF
    A high level language for the Massively Parallel Processor (MPP) was designed. This language, called Parallel Pascal, is described in detail. A description of the language design, a description of the intermediate language, Parallel P-Code, and details for the MPP implementation are included. Formal descriptions of Parallel Pascal and Parallel P-Code are given. A compiler was developed which converts programs in Parallel Pascal into the intermediate Parallel P-Code language. The code generator to complete the compiler for the MPP is being developed independently. A Parallel Pascal to Pascal translator was also developed. The architecture design for a VLSI version of the MPP was completed with a description of fault tolerant interconnection networks. The memory arrangement aspects of the MPP are discussed and a survey of other high level languages is given

    A Specialized Processor for Track Reconstruction at the LHC Crossing Rate

    Full text link
    We present the results of an R&D study of a specialized processor capable of precisely reconstructing events with hundreds of charged-particle tracks in pixel detectors at 40 MHz, thus suitable for processing LHC events at the full crossing frequency. For this purpose we design and test a massively parallel pattern-recognition algorithm, inspired by studies of the processing of visual images by the brain as it happens in nature. We find that high-quality tracking in large detectors is possible with sub-ÎĽ\mus latencies when this algorithm is implemented in modern, high-speed, high-bandwidth FPGA devices. This opens a possibility of making track reconstruction happen transparently as part of the detector readout.Comment: Presented by G.Punzi at the conference on "Instrumentation for Colliding Beam Physics" (INSTR14), 24 Feb to 1 Mar 2014, Novosibirsk, Russia. Submitted to JINST proceeding
    • …
    corecore