47 research outputs found
StochKit-FF: Efficient Systems Biology on Multicore Architectures
The stochastic modelling of biological systems is an informative, and in some
cases, very adequate technique, which may however result in being more
expensive than other modelling approaches, such as differential equations. We
present StochKit-FF, a parallel version of StochKit, a reference toolkit for
stochastic simulations. StochKit-FF is based on the FastFlow programming
toolkit for multicores and exploits the novel concept of selective memory. We
experiment StochKit-FF on a model of HIV infection dynamics, with the aim of
extracting information from efficiently run experiments, here in terms of
average and variance and, on a longer term, of more structured data.Comment: 14 pages + cover pag
Optimal processor assignment for pipeline computations
The availability of large scale multitasked parallel architectures introduces the following processor assignment problem for pipelined computations. Given a set of tasks and their precedence constraints, along with their experimentally determined individual responses times for different processor sizes, find an assignment of processor to tasks. Two objectives are of interest: minimal response given a throughput requirement, and maximal throughput given a response time requirement. These assignment problems differ considerably from the classical mapping problem in which several tasks share a processor; instead, it is assumed that a large number of processors are to be assigned to a relatively small number of tasks. Efficient assignment algorithms were developed for different classes of task structures. For a p processor system and a series parallel precedence graph with n constituent tasks, an O(np2) algorithm is provided that finds the optimal assignment for the response time optimization problem; it was found that the assignment optimizing the constrained throughput in O(np2log p) time. Special cases of linear, independent, and tree graphs are also considered
Parallel Algorithms for Time and Frequency Domain Circuit Simulation
As a most critical form of pre-silicon verification, transistor-level circuit simulation
is an indispensable step before committing to an expensive manufacturing process.
However, considering the nature of circuit simulation, it can be computationally
expensive, especially for ever-larger transistor circuits with more complex device models.
Therefore, it is becoming increasingly desirable to accelerate circuit simulation.
On the other hand, the emergence of multi-core machines offers a promising solution
to circuit simulation besides the known application of distributed-memory clustered
computing platforms, which provides abundant hardware computing resources. This
research addresses the limitations of traditional serial circuit simulations and proposes
new techniques for both time-domain and frequency-domain parallel circuit
simulations.
For time-domain simulation, this dissertation presents a parallel transient simulation
methodology. This new approach, called WavePipe, exploits coarse-grained
application-level parallelism by simultaneously computing circuit solutions at multiple
adjacent time points in a way resembling hardware pipelining. There are two
embodiments in WavePipe: backward and forward pipelining schemes. While the
former creates independent computing tasks that contribute to a larger future time
step, the latter performs predictive computing along the forward direction. Unlike
existing relaxation methods, WavePipe facilitates parallel circuit simulation without jeopardizing convergence and accuracy. As a coarse-grained parallel approach, it requires
low parallel programming effort, furthermore it creates new avenues to have a
full utilization of increasingly parallel hardware by going beyond conventional finer
grained parallel device model evaluation and matrix solutions.
This dissertation also exploits the recently developed explicit telescopic projective
integration method for efficient parallel transient circuit simulation by addressing the
stability limitation of explicit numerical integration. The new method allows the
effective time step controlled by accuracy requirement instead of stability limitation.
Therefore, it not only leads to noticeable efficiency improvement, but also lends itself
to straightforward parallelization due to its explicit nature.
For frequency-domain simulation, this dissertation presents a parallel harmonic
balance approach, applicable to the steady-state and envelope-following analyses of
both driven and autonomous circuits. The new approach is centered on a naturally-parallelizable
preconditioning technique that speeds up the core computation in harmonic
balance based analysis. The proposed method facilitates parallel computing
via the use of domain knowledge and simplifies parallel programming compared with
fine-grained strategies. As a result, favorable runtime speedups are achieved
DPS - Dynamic Parallel Schedules
Dynamic Parallel Schedules (DPS) is a high-level framework for developing parallel applications on distributed memory computers (e.g. clusters of PC). Its model relies on compositional customizable split-compute-merge graphs of operations (directed acyclic flow graphs). The graphs and the mapping of operations to processing nodes are specified dynamically at runtime. DPS applications are pipelined and multithreaded by construction, ensuring a maximal overlap of computations and communications. DPS applications can call parallel services exposed by other DPS applications, enabling the creation of reusable parallel components. The DPS framework relies on a C++ class library. Thanks to its dynamic nature, DPS offers new perspectives for the creation and deployment of parallel applications running on server cluster
Overlapping of Communication and Computation and Early Binding: Fundamental Mechanisms for Improving Parallel Performance on Clusters of Workstations
This study considers software techniques for improving performance on clusters of workstations and approaches for designing message-passing middleware that facilitate scalable, parallel processing. Early binding and overlapping of communication and computation are identified as fundamental approaches for improving parallel performance and scalability on clusters. Currently, cluster computers using the Message-Passing Interface for interprocess communication are the predominant choice for building high-performance computing facilities, which makes the findings of this work relevant to a wide audience from the areas of high-performance computing and parallel processing. The performance-enhancing techniques studied in this work are presently underutilized in practice because of the lack of adequate support by existing message-passing libraries and are also rarely considered by parallel algorithm designers. Furthermore, commonly accepted methods for performance analysis and evaluation of parallel systems omit these techniques and focus primarily on more obvious communication characteristics such as latency and bandwidth. This study provides a theoretical framework for describing early binding and overlapping of communication and computation in models for parallel programming. This framework defines four new performance metrics that facilitate new approaches for performance analysis of parallel systems and algorithms. This dissertation provides experimental data that validate the correctness and accuracy of the performance analysis based on the new framework. The theoretical results of this performance analysis can be used by designers of parallel system and application software for assessing the quality of their implementations and for predicting the effective performance benefits of early binding and overlapping. This work presents MPI/Pro, a new MPI implementation that is specifically optimized for clusters of workstations interconnected with high-speed networks. This MPI implementation emphasizes features such as persistent communication, asynchronous processing, low processor overhead, and independent message progress. These features are identified as critical for delivering maximum performance to applications. The experimental section of this dissertation demonstrates the capability of MPI/Pro to facilitate software techniques that result in significant application performance improvements. Specific demonstrations with Virtual Interface Architecture and TCP/IP over Ethernet are offered
An Analysis for Evaluating the Cost/Profit Effectiveness of Parallel Systems
A new domain of commercial applications demands the development of inexpensive parallel computing platforms to lower the cost of operations and increase the business profit. The calculation of returns on an IT investment is now important to justify the decision of upgrading or replacing parallel systems. This thesis presents a framework of the performance and economic factors that are considered when evaluating a parallel system. We introduce a metric called the cost/profit effective metric, which measures the effectiveness of a parallel system in terms of performance, cost and profit. This metric describes the profit obtained from the performance of three different domains for scaling: speed-up, throughput and/or scale-up. Cost is measured by the actual costs of a parallel system. We present two cases of study to demonstrate the application of this metric and analyze the results to support the evaluation of the parallel system on each case
Recommended from our members
Towards architecture-adaptable parallel programming
There is a software gap in parallel processing. The short lifespan and small installation base of parallel architectures have made it economically infeasible to develop platform-specific parallel programming environments that deliver performance and programmability. One obvious solution is to build architecture-independent programming environments. But the architecture independence usually comes at the expense of performance, since the most efficient parallel algorithm for solving a problem often depends on the target platform. Thus, unless a parallel programming system has the ability to adapt the algorithm to the architecture, it will not be effectively machine-independent.
This research develops a new methodology for architecture-adaptable parallel programming. The methodology is built on three key ideas: (1) the use of a database of parameterized algorithmic templates to represent computable functions; (2) frame-based representation of processing environments; and (3) the use of an analytical performance prediction tool for automatic algorithm design.
This methodology pursues a problem-oriented approach to parallel processing as opposed to the traditional algorithm-oriented approach. This enables the
development of software environments with a high level of abstraction. The users state the problem to be solved using a high-level notation; they are freed from the esoteric tasks of parallel algorithm design and implementation.
This methodology has been validated in the format of a prototype of a system capable of automatically generating an efficient parallel program when presented with a well-defined problem and the description of a target platform. The use of object technology has made the system easily extensible. The templates are designed using a parallel adaptation of the well-known divide-and-conquer paradigm.
The prototype system has been used to solve several numerical problems efficiently on a wide spectrum of architectures. The target platforms include multicomputers (Thinking Machines CM-5 and Meiko CS-2), networks of workstations (IBM RS/6000s connected by FDDI), multiprocessors (Sequent Symmetry, SGI Power Challenge, and Sun SPARCServer), and a hierarchical system consisting of a cluster of multiprocessors on Myrinet