386,773 research outputs found
CRBLASTER: A Parallel-Processing Computational Framework for Embarrassingly-Parallel Image-Analysis Algorithms
The development of parallel-processing image-analysis codes is generally a
challenging task that requires complicated choreography of interprocessor
communications. If, however, the image-analysis algorithm is embarrassingly
parallel, then the development of a parallel-processing implementation of that
algorithm can be a much easier task to accomplish because, by definition, there
is little need for communication between the compute processes. I describe the
design, implementation, and performance of a parallel-processing image-analysis
application, called CRBLASTER, which does cosmic-ray rejection of CCD
(charge-coupled device) images using the embarrassingly-parallel L.A.COSMIC
algorithm. CRBLASTER is written in C using the high-performance computing
industry standard Message Passing Interface (MPI) library. The code has been
designed to be used by research scientists who are familiar with C as a
parallel-processing computational framework that enables the easy development
of parallel-processing image-analysis programs based on embarrassingly-parallel
algorithms. The CRBLASTER source code is freely available at the official
application website at the National Optical Astronomy Observatory. Removing
cosmic rays from a single 800x800 pixel Hubble Space Telescope WFPC2 image
takes 44 seconds with the IRAF script lacos_im.cl running on a single core of
an Apple Mac Pro computer with two 2.8-GHz quad-core Intel Xeon processors.
CRBLASTER is 7.4 times faster processing the same image on a single core on the
same machine. Processing the same image with CRBLASTER simultaneously on all 8
cores of the same machine takes 0.875 seconds -- which is a speedup factor of
50.3 times faster than the IRAF script. A detailed analysis is presented of the
performance of CRBLASTER using between 1 and 57 processors on a low-power
Tilera 700-MHz 64-core TILE64 processor.Comment: 8 pages, 2 figures, 1 table, accepted for publication in PAS
On Probabilistic Parallel Programs with Process Creation and Synchronisation
We initiate the study of probabilistic parallel programs with dynamic process
creation and synchronisation. To this end, we introduce probabilistic
split-join systems (pSJSs), a model for parallel programs, generalising both
probabilistic pushdown systems (a model for sequential probabilistic procedural
programs which is equivalent to recursive Markov chains) and stochastic
branching processes (a classical mathematical model with applications in
various areas such as biology, physics, and language processing). Our pSJS
model allows for a possibly recursive spawning of parallel processes; the
spawned processes can synchronise and return values. We study the basic
performance measures of pSJSs, especially the distribution and expectation of
space, work and time. Our results extend and improve previously known results
on the subsumed models. We also show how to do performance analysis in
practice, and present two case studies illustrating the modelling power of
pSJSs.Comment: This is a technical report accompanying a TACAS'11 pape
Recommended from our members
Improving parallel program performance using critical path analysis
A programming tool that performs analysis of critical paths for parallel programs has been developed. This tool determines the critical path for the program as scheduled onto a parallel computer with P processing elements, the critical path for the program expressed as a data flow graph (when maximal parallelism can be expressed), and the minimum number of processing elements (P_opt) needed to obtain maximum program speedup. Experiments were performed using several versions of a Gaussian elimination program to examine how speedup varied with changes in granularity and critical path length. These experiments showed that when the available numer of processing elements P < P_opt, increasing granularity improved program speedup more than reducing (the data flow graph's) critical path length, whereas when P ≥ P_opt, increasing granularity degraded program speedup while reducing critical path length improved program speedup
PETRI NET BASED MODELING OF PARALLEL PROGRAMS EXECUTING ON DISTRIBUTED MEMORY MULTIPROCESSOR SYSTEMS
The development of parallel programs following the paradigm of communicating sequen-
tial processes to be executed on distributed memory multiprocessor systems is addressed.
The key issue in programming parallel machines today is to provide computerized tools
supporting the development of efficient parallel software, i.e. software effectively har-
nessing the power of parallel processing systems. The critical situations where a parallel
programmer needs help is in expressing a parallel algorithm in a programming language,
in getting a parallel program to work and in tuning it to get optimum performance (for
example speedup). .
We show that the Petri net formalism is higly suitable as a performance modeling
technique for asynchronous parallel systems, by introducing a model taking care of the
parallel program, parallel architecture and mapping influences on overall system perfor-
mance. PRM -net (Program-Resource- Mapping) models comprise a Petri net model of the
multiple flows of control in a parallel program, a Petri net model of the parallel hardware
and the process-to-processor mapping information into a single integrated performance
model. Automated analysis of PRM-net models addresses correctness and performance
of parallel programs mapped to parallel hardware. Questions upon the correctness of
parallel programs can be answered by investigating behavioural properties of Petri net
programs like liveness, reachability, boundedness, mutualy exclusiveness etc. Peformance
of parallel programs is usefully considered only in concern with a dedicated target hard-
ware. For this reason it is essential to integrate multiprocessor hardware characteristics
into the specification of a parallel program. The integration is done by assigning the
concurrent processes to physical processing devices and communication patterns among
parallel processes to communication media connecting processing elements yielding an in-
tegrated, Petri net based performance model. Evaluation of the integrated model applies
simulation and markovian analysis to derive expressions characterising the peformance of
the program being developed.
Synthesis and decomposition rules for hierarchical models naturally give raise to
use PRM-net models for graphical, performance oriented parallel programming, support-
ing top-down (stepwise refinement) as well as bottom-up development approaches. The
graphical representation of Petri net programs visualizes phenomena like parallelism, syn-
chronisation, communication, sequential and alternative execution. Modularity of pro-
gram blocks aids reusability, prototyping is promoted by automated code generation on
the basis of high level program specifications
The paradigm compiler: Mapping a functional language for the connection machine
The Paradigm Compiler implements a new approach to compiling programs written in high level languages for execution on highly parallel computers. The general approach is to identify the principal data structures constructed by the program and to map these structures onto the processing elements of the target machine. The mapping is chosen to maximize performance as determined through compile time global analysis of the source program. The source language is Sisal, a functional language designed for scientific computations, and the target language is Paris, the published low level interface to the Connection Machine. The data structures considered are multidimensional arrays whose dimensions are known at compile time. Computations that build such arrays usually offer opportunities for highly parallel execution; they are data parallel. The Connection Machine is an attractive target for these computations, and the parallel for construct of the Sisal language is a convenient high level notation for data parallel algorithms. The principles and organization of the Paradigm Compiler are discussed
Three-dimensional boundary layer analysis program Blay and its application
The boundary layer calculation program (BLAY) is a program code which accurately analyzes the three-dimensional boundary layer of a wing with an undefined plane. In comparison with other preexisting programs, the BLAY is characterized by the following: (1) the time required for computation is shorter than any other; (2) the program is adaptable to a parallel processing computer; and (3) the program is associated with a secondary accuracy in the z-direction. As a boundary layer modification to transonic nonviscous flow analysis programs, it is used to adjust viscous and nonviscous interference problems repeatedly. Its efficiency is an important factor in cost reduction in aircraft designing
Towards Parallel Programming Models for Predictability
Future embedded systems for performance-demanding applications will be massively parallel. High performance tasks will be parallel programs, running on several cores, rather than single threads running on single cores. For hard real-time applications, WCETs for such tasks must be bounded. Low-level parallel programming models, based on concurrent threads, are notoriously hard to use due to their inherent nondeterminism. Therefore the parallel processing community
has long considered high-level parallel programming models, which restrict the low-level models to regain determinism. In this position paper we argue that such parallel programming models are beneficial also for WCET analysis of parallel programs. We review some proposed models, and discuss their influence on timing predictability. In particular we identify data parallel programming as a suitable paradigm as it is deterministic and allows current methods for WCET
analysis to be extended to parallel code. GPUs are increasingly used for high performance applications: we discuss a current GPU architecture, and we argue that it offers a parallel platform
for compute-intensive applications for which it seems possible to construct precise timing models. Thus, a promising route for the future is to develop WCET analyses for data-parallel software running on GPUs
PyPele Rewritten To Use MPI
A computer program known as PyPele, originally written as a Pythonlanguage extension module of a C++ language program, has been rewritten in pure Python language. The original version of PyPele dispatches and coordinates parallel-processing tasks on cluster computers and provides a conceptual framework for spacecraft-mission- design and -analysis software tools to run in an embarrassingly parallel mode. The original version of PyPele uses SSH (Secure Shell a set of standards and an associated network protocol for establishing a secure channel between a local and a remote computer) to coordinate parallel processing. Instead of SSH, the present Python version of PyPele uses Message Passing Interface (MPI) [an unofficial de-facto standard language-independent application programming interface for message- passing on a parallel computer] while keeping the same user interface. The use of MPI instead of SSH and the preservation of the original PyPele user interface make it possible for parallel application programs written previously for the original version of PyPele to run on MPI-based cluster computers. As a result, engineers using the previously written application programs can take advantage of embarrassing parallelism without need to rewrite those programs
Parallel programming in biomedical signal processing
Dissertação para obtenção do Grau de Mestre em
Engenharia BiomédicaPatients with neuromuscular and cardiorespiratory diseases need to be monitored continuously. This constant monitoring gives rise to huge amounts of multivariate data which need to be processed as soon as possible, so that their most relevant features can be extracted.
The field of parallel processing, an area from the computational sciences, comes naturally as a way to provide an answer to this problem. For the parallel processing to succeed it is necessary to adapt the pre-existing signal processing algorithms to the modern architectures of computer systems with several processing units.
In this work parallel processing techniques are applied to biosignals, connecting the
area of computer science to the biomedical domain. Several considerations are made
on how to design parallel algorithms for signal processing, following the data parallel paradigm. The emphasis is given to algorithm design, rather than the computing
systems that execute these algorithms. Nonetheless, shared memory systems and distributed memory systems are mentioned in the present work.
Two signal processing tools integrating some of the parallel programming concepts
mentioned throughout this work were developed. These tools allow a fast and efficient analysis of long-term biosignals. The two kinds of analysis are focused on heart rate variability and breath frequency, and aim to the processing of electrocardiograms and respiratory signals, respectively.
The proposed tools make use of the several processing units that most of the actual
computers include in their architecture, giving the clinician a fast tool without him having to set up a system specifically meant to run parallel programs
- …