19,184 research outputs found
Recommended from our members
A practical WSI experimental programme
At Brunel University, research has been underway for several years to assess the architectural, electrical and physical benefits and constraints of the WASP wafer-scale Associative String Processor (ASP). This is intended to implement a massively parallel processor entirely within the constraints of WSI. WASP 1 and WASP 2 were the technology demonstrators of the UK funded Alvey programme (starting 1984), researching fundamental design methodologies for WSI. They are both examples of the Associative String Processor (ASP) architecture, developed by Brunel University. Further demonstrators are currently funded by a 31/2-year US ONR IS&T programme (starting 1987), involving both further technology demonstration, applications research and fundamental packaging and manufacturing design issue
Design and Implementation of a Massively Parallel Version of DIRECT
This paper describes several massively parallel implementations for a global search algorithm DIRECT.
Two parallel schemes take different approaches to address DIRECT's design challenges imposed by memory requirements
and data dependency. Three design aspects in topology, data structures, and task allocation are compared in
detail. The goal is to analytically investigate the strengths and weaknesses of these parallel schemes, identify several
key sources of inefficiency, and experimentally evaluate a number of improvements in the latest parallel DIRECT
implementation. The performance studies demonstrate improved data structure efficiency and load balancing on a
2200 processor cluster
Performance Modeling and Analysis of a Massively Parallel DIRECT— Part 2
Modeling and analysis techniques are used to investigate
the performance of a massively parallel version
of DIRECT, a global search algorithm widely used
in multidisciplinary design optimization applications.
Several highdimensional
benchmark functions and
real world problems are used to test the design
effectiveness under various problem structures. In
this second part of a twopart
work, theoretical and
experimental results are compared for two parallel
clusters with different system scale and network
connectivity. The first part studied performance
sensitivity to important parameters for problem configurations
and parallel schemes, using performance
metrics such as memory usage, load balancing,
and parallel efficiency. Here linear regression models
are used to characterize two major overhead
sources—interprocessor communication and processor
idleness—and also applied to the isoefficiency
functions in scalability analysis. For a variety of
highdimensional
problems and large scale systems,
the massively parallel design has achieved reasonable
performance. The results of the performance
study provide guidance for efficient problem and
scheme configuration. More importantly, the design
considerations and analysis techniques generalize to
the transformation of other global search algorithms
into effective large scale parallel optimization tools
Stepwise transformation of algorithms into array processor architectures by the decomp
A formal approach for the transformation of computation intensive digital signal processing algorithms into suitable array processor architectures is presented. It covers the complete design flow from algorithmic specifications in a high-level programming language to architecture descriptions in a hardware description language. The transformation itself is divided into manageable design steps and implemented in the CAD-tool DECOMP which allows the exploration of different architectures in a short time. With the presented approach data independent algorithms can be mapped onto array processor architectures. To allow this, a known mapping methodology for array processor design is extended to handle inhomogeneous dependence graphs with nonregular data dependences. The implementation of the formal approach in the DECOMP is an important step towards design automation for massively parallel systems
Performance Modeling and Analysis of a Massively Parallel DIRECT— Part 1
Modeling and analysis techniques are used to investigate
the performance of a massively parallel version
of DIRECT, a global search algorithm widely used
in multidisciplinary design optimization applications.
Several highdimensional
benchmark functions and
real world problems are used to test the design effectiveness
under various problem structures. Theoretical
and experimental results are compared for two
parallel clusters with different system scale and network
connectivity. The present work aims at studying
the performance sensitivity to important parameters
for problem configurations, parallel schemes,
and system settings. The performance metrics
include the memory usage, load balancing, parallel
efficiency, and scalability. An analytical bounding
model is constructed to measure the load balancing
performance under different schemes. Additionally,
linear regression models are used to characterize
two major overhead sources—interprocessor communication
and processor idleness, and also applied
to the isoefficiency functions in scalability analysis.
For a variety of highdimensional
problems and large
scale systems, the massively parallel design has
achieved reasonable performance. The results of
the performance study provide guidance for efficient
problem and scheme configuration. More importantly,
the generalized design considerations and
analysis techniques are beneficial for transforming
many global search algorithms to become effective
large scale parallel optimization tools
Runtime volume visualization for parallel CFD
This paper discusses some aspects of design of a data distributed, massively parallel volume rendering library for runtime visualization of parallel computational fluid dynamics simulations in a message-passing environment. Unlike the traditional scheme in which visualization is a postprocessing step, the rendering is done in place on each node processor. Computational scientists who run large-scale simulations on a massively parallel computer can thus perform interactive monitoring of their simulations. The current library provides an interface to handle volume data on rectilinear grids. The same design principles can be generalized to handle other types of grids. For demonstration, we run a parallel Navier-Stokes solver making use of this rendering library on the Intel Paragon XP/S. The interactive visual response achieved is found to be very useful. Performance studies show that the parallel rendering process is scalable with the size of the simulation as well as with the parallel computer
Parallel Implementation of the PHOENIX Generalized Stellar Atmosphere Program. II: Wavelength Parallelization
We describe an important addition to the parallel implementation of our
generalized NLTE stellar atmosphere and radiative transfer computer program
PHOENIX. In a previous paper in this series we described data and task parallel
algorithms we have developed for radiative transfer, spectral line opacity, and
NLTE opacity and rate calculations. These algorithms divided the work spatially
or by spectral lines, that is distributing the radial zones, individual
spectral lines, or characteristic rays among different processors and employ,
in addition task parallelism for logically independent functions (such as
atomic and molecular line opacities). For finite, monotonic velocity fields,
the radiative transfer equation is an initial value problem in wavelength, and
hence each wavelength point depends upon the previous one. However, for
sophisticated NLTE models of both static and moving atmospheres needed to
accurately describe, e.g., novae and supernovae, the number of wavelength
points is very large (200,000--300,000) and hence parallelization over
wavelength can lead both to considerable speedup in calculation time and the
ability to make use of the aggregate memory available on massively parallel
supercomputers. Here, we describe an implementation of a pipelined design for
the wavelength parallelization of PHOENIX, where the necessary data from the
processor working on a previous wavelength point is sent to the processor
working on the succeeding wavelength point as soon as it is known. Our
implementation uses a MIMD design based on a relatively small number of
standard MPI library calls and is fully portable between serial and parallel
computers.Comment: AAS-TeX, 15 pages, full text with figures available at
ftp://calvin.physast.uga.edu/pub/preprints/Wavelength-Parallel.ps.gz ApJ, in
pres
GRAPE-6: The massively-parallel special-purpose computer for astrophysical particle simulation
In this paper, we describe the architecture and performance of the GRAPE-6
system, a massively-parallel special-purpose computer for astrophysical
-body simulations. GRAPE-6 is the successor of GRAPE-4, which was completed
in 1995 and achieved the theoretical peak speed of 1.08 Tflops. As was the case
with GRAPE-4, the primary application of GRAPE-6 is simulation of collisional
systems, though it can be used for collisionless systems. The main differences
between GRAPE-4 and GRAPE-6 are (a) The processor chip of GRAPE-6 integrates 6
force-calculation pipelines, compared to one pipeline of GRAPE-4 (which needed
3 clock cycles to calculate one interaction), (b) the clock speed is increased
from 32 to 90 MHz, and (c) the total number of processor chips is increased
from 1728 to 2048. These improvements resulted in the peak speed of 64 Tflops.
We also discuss the design of the successor of GRAPE-6.Comment: Accepted for publication in PASJ, scheduled to appear in Vol. 55, No.
The language parallel Pascal and other aspects of the massively parallel processor
A high level language for the Massively Parallel Processor (MPP) was designed. This language, called Parallel Pascal, is described in detail. A description of the language design, a description of the intermediate language, Parallel P-Code, and details for the MPP implementation are included. Formal descriptions of Parallel Pascal and Parallel P-Code are given. A compiler was developed which converts programs in Parallel Pascal into the intermediate Parallel P-Code language. The code generator to complete the compiler for the MPP is being developed independently. A Parallel Pascal to Pascal translator was also developed. The architecture design for a VLSI version of the MPP was completed with a description of fault tolerant interconnection networks. The memory arrangement aspects of the MPP are discussed and a survey of other high level languages is given
A Specialized Processor for Track Reconstruction at the LHC Crossing Rate
We present the results of an R&D study of a specialized processor capable of
precisely reconstructing events with hundreds of charged-particle tracks in
pixel detectors at 40 MHz, thus suitable for processing LHC events at the full
crossing frequency. For this purpose we design and test a massively parallel
pattern-recognition algorithm, inspired by studies of the processing of visual
images by the brain as it happens in nature. We find that high-quality tracking
in large detectors is possible with sub-s latencies when this algorithm is
implemented in modern, high-speed, high-bandwidth FPGA devices. This opens a
possibility of making track reconstruction happen transparently as part of the
detector readout.Comment: Presented by G.Punzi at the conference on "Instrumentation for
Colliding Beam Physics" (INSTR14), 24 Feb to 1 Mar 2014, Novosibirsk, Russia.
Submitted to JINST proceeding
- …