440 research outputs found
Implementation and Evaluation of Algorithmic Skeletons: Parallelisation of Computer Algebra Algorithms
This thesis presents design and implementation approaches for the parallel algorithms of computer algebra. We use algorithmic skeletons and also further approaches, like data parallel arithmetic and actors. We have implemented skeletons for divide and conquer algorithms and some special parallel loops, that we call ‘repeated computation with a possibility of premature termination’. We introduce in this thesis a rational data parallel arithmetic. We focus on parallel symbolic computation algorithms, for these algorithms our arithmetic provides a generic parallelisation approach.
The implementation is carried out in Eden, a parallel functional programming language based on Haskell. This choice enables us to encode both the skeletons and the programs in the same language. Moreover, it allows us to refrain from using two different languages—one for the implementation and one for the interface—for our implementation of computer algebra algorithms.
Further, this thesis presents methods for evaluation and estimation of parallel execution times. We partition the parallel execution time into two components. One of them accounts for the quality of the parallelisation, we call it the ‘parallel penalty’. The other is the sequential execution time. For the estimation, we predict both components separately, using statistical methods. This enables very confident estimations, although using drastically less measurement points than other methods. We have applied both our evaluation and estimation approaches to the parallel programs presented in this thesis. We haven also used existing estimation methods.
We developed divide and conquer skeletons for the implementation of fast parallel multiplication. We have implemented the Karatsuba algorithm, Strassen’s matrix multiplication algorithm and the fast Fourier transform. The latter was used to implement polynomial convolution that leads to a further fast multiplication algorithm. Specially for our implementation of Strassen algorithm we have designed and implemented a divide and conquer skeleton basing on actors. We have implemented the parallel fast Fourier transform, and not only did we use new divide and conquer skeletons, but also developed a map-and-transpose skeleton. It enables good parallelisation of the Fourier transform. The parallelisation of Karatsuba multiplication shows a very good performance. We have analysed the parallel penalty of our programs and compared it to the serial fraction—an approach, known from literature. We also performed execution time estimations of our divide and conquer programs.
This thesis presents a parallel map+reduce skeleton scheme. It allows us to combine the usual parallel map skeletons, like parMap, farm, workpool, with a premature termination property. We use this to implement the so-called ‘parallel repeated computation’, a special form of a speculative parallel loop. We have implemented two probabilistic primality tests: the Rabin–Miller test and the Jacobi sum test. We parallelised both with our approach. We analysed the task distribution and stated the fitting configurations of the Jacobi sum test. We have shown formally that the Jacobi sum test can be implemented in parallel. Subsequently, we parallelised it, analysed the load balancing issues, and produced an optimisation. The latter enabled a good implementation, as verified using the parallel penalty. We have also estimated the performance of the tests for further input sizes and numbers of processing elements. Parallelisation of the Jacobi sum test and our generic parallelisation scheme for the repeated computation is our original contribution.
The data parallel arithmetic was defined not only for integers, which is already known, but also for rationals. We handled the common factors of the numerator or denominator of the fraction with the modulus in a novel manner. This is required to obtain a true multiple-residue arithmetic, a novel result of our research. Using these mathematical advances, we have parallelised the determinant computation using the Gauß elimination. As always, we have performed task distribution analysis and estimation of the parallel execution time of our implementation. A similar computation in Maple emphasised the potential of our approach. Data parallel arithmetic enables parallelisation of entire classes of computer algebra algorithms.
Summarising, this thesis presents and thoroughly evaluates new and existing design decisions for high-level parallelisations of computer algebra algorithms
Towards Generic Scalable Parallel Combinatorial Search
Combinatorial search problems in mathematics, e.g. in finite geometry, are notoriously hard; a state-of-the-art backtracking search algorithm can easily take months to solve a single problem. There is clearly demand for parallel combinatorial search algorithms scaling to hundreds of cores and beyond. However, backtracking combinatorial searches are challenging to parallelise due to their sensitivity to search order and due to the their irregularly shaped search trees. Moreover, scaling parallel search to hundreds of cores generally requires highly specialist parallel programming expertise.
This paper proposes a generic scalable framework for solving hard combinatorial problems. Key elements are distributed memory task parallelism (to achieve scale), work stealing (to cope with irregularity), and generic algorithmic skeletons for combinatorial search (to reduce the parallelism expertise required). We outline two implementations: a mature Haskell Tree Search Library (HTSL) based around algorithmic skeletons and a prototype C++ Tree Search Library (CTSL) that uses hand coded applications.
Experiments on maximum clique problems and on a problem in finite geometry, the search for spreads in H(4,2^2), show that (1) CTSL consistently outperforms HTSL on sequential runs, and (2) both libraries scale to 200 cores, e.g. speeding up spreads search by a factor of 81 (HTSL) and 60 (CTSL), respectively. This demonstrates the potential of our generic framework for scaling parallel combinatorial search to large distributed memory platforms
Garbage Collection for Java Distributed Objects
We present a distributed garbage collection algorithm for Java distributed objects using the object model provided by the Java Support for Distributed Objects (JSDA) object model and using weak references in Java. The algorithm can also be used for any other Java based distributed object models that use the stub-skeleton paradigm. Furthermore, the solution could also be applied to any language that supports weak references as a mean of interaction with the local garbage collector. We also give a formal definition and a proof of correctness for the proposed algorithm
The Paragraph: Design and Implementation of the STAPL Parallel Task Graph
Parallel programming is becoming mainstream due to the increased availability
of multiprocessor and multicore architectures and the need to solve larger and
more complex problems. Languages and tools available for the development of
parallel applications are often difficult to learn and use. The Standard Template
Adaptive Parallel Library (STAPL) is being developed to help programmers
address these difficulties.
STAPL is a parallel C++ library with functionality similar to STL, the ISO
adopted C++ Standard Template Library. STAPL provides
a collection of parallel pContainers for data storage and pViews that
provide uniform data access operations by abstracting away the details of
the pContainer data distribution. Generic pAlgorithms are written in terms of PARAGRAPHs,
high level task graphs expressed as a composition of common parallel patterns.
These task graphs define a set of operations on pViews as well as any
ordering (i.e., dependences) on these operations that must be enforced by
STAPL for a valid execution. The subject of this dissertation is the PARAGRAPH Executor,
a framework that manages the runtime instantiation and execution of STAPL
PARAGRAPHS.
We address several challenges present when using a task graph program representation
and discuss a novel approach to dependence specification which allows task graph creation
and execution to proceed concurrently. This overlapping increases scalability and
reduces the resources required by the PARAGRAPH Executor. We also describe the interface for task
specification as well as optimizations that address issues such as data locality.
We evaluate the performance of the PARAGRAPH Executor on several parallel machines including
massively parallel Cray XT4 and Cray XE6 systems and an IBM Power5 cluster.
Using tests including generic parallel algorithms, kernels from the NAS NPB suite,
and a nuclear particle transport application written in STAPL, we demonstrate that the
PARAGRAPH Executor enables STAPL to exhibit good scalability on more than processors
Multi-GPU support on the marrow algorithmic skeleton framework
Dissertação para obtenção do Grau de Mestre em
Engenharia InformáticaWith the proliferation of general purpose GPUs, workload parallelization and datatransfer optimization became an increasing concern. The natural evolution from using a single GPU, is multiplying the amount of available processors, presenting new challenges, as tuning the workload decompositions and load balancing, when dealing with heterogeneous systems.
Higher-level programming is a very important asset in a multi-GPU environment, due to the complexity inherent to the currently used GPGPU APIs (OpenCL and CUDA), because of their low-level and code overhead. This can be obtained by introducing an abstraction layer, which has the advantage of enabling implicit optimizations and orchestrations
such as transparent load balancing mechanism and reduced explicit code overhead.
Algorithmic Skeletons, previously used in cluster environments, have recently been
adapted to the GPGPU context. Skeletons abstract most sources of code overhead, by
defining computation patterns of commonly used algorithms. The Marrow algorithmic
skeleton library is one of these, taking advantage of the abstractions to automate the
orchestration needed for an efficient GPU execution.
This thesis proposes the extension of Marrow to leverage the use of algorithmic skeletons
in the modular and efficient programming of multiple heterogeneous GPUs, within a single machine.
We were able to achieve a good balance between simplicity of the programming model and performance, obtaining good scalability when using multiple GPUs, with an efficient load distribution, although at the price of some overhead when using a single-GPU.projects PTDC/EIA-EIA/102579/2008 and PTDC/EIA-EIA/111518/200
Automatic skeleton-driven performance optimizations for transactional memory
The recent shift toward multi -core chips has pushed the burden of extracting performance to the programmer. In fact, programmers now have to be able to uncover more
coarse -grain parallelism with every new generation of processors, or the performance
of their applications will remain roughly the same or even degrade. Unfortunately,
parallel programming is still hard and error prone. This has driven the development of
many new parallel programming models that aim to make this process efficient.This thesis first combines the skeleton -based and transactional memory programming models in a new framework, called OpenSkel, in order to improve performance
and programmability of parallel applications. This framework provides a single skeleton that allows the implementation of transactional worklist applications. Skeleton or
pattern-based programming allows parallel programs to be expressed as specialized instances of generic communication and computation patterns. This leaves the programmer with only the implementation of the particular operations required to solve the
problem at hand. Thus, this programming approach simplifies parallel programming
by eliminating some of the major challenges of parallel programming, namely thread
communication, scheduling and orchestration. However, the application programmer
has still to correctly synchronize threads on data races. This commonly requires the
use of locks to guarantee atomic access to shared data. In particular, lock programming
is vulnerable to deadlocks and also limits coarse grain parallelism by blocking threads
that could be potentially executed in parallel.Transactional Memory (TM) thus emerges as an attractive alternative model to simplify parallel programming by removing this burden of handling data races explicitly.
This model allows programmers to write parallel code as transactions, which are then
guaranteed by the runtime system to execute atomically and in isolation regardless of
eventual data races. TM programming thus frees the application from deadlocks and
enables the exploitation of coarse grain parallelism when transactions do not conflict
very often. Nevertheless, thread management and orchestration are left for the application programmer. Fortunately, this can be naturally handled by a skeleton framework.
This fact makes the combination of skeleton -based and transactional programming a
natural step to improve programmability since these models complement each other.
In fact, this combination releases the application programmer from dealing with thread
management and data races, and also inherits the performance improvements of both
models. In addition to it, a skeleton framework is also amenable to skeleton - driven
iii
performance optimizations that exploits the application pattern and system information.This thesis thus also presents a set of pattern- oriented optimizations that are automatically selected and applied in a significant subset of transactional memory applications that shares a common pattern called worklist. These optimizations exploit the
knowledge about the worklist pattern and the TM nature of the applications to avoid
transaction conflicts, to prefetch data, to reduce contention etc. Using a novel autotuning mechanism, OpenSkel dynamically selects the most suitable set of these patternoriented performance optimizations for each application and adjusts them accordingly.
Experimental results on a subset of five applications from the STAMP benchmark suite
show that the proposed autotuning mechanism can achieve performance improvements
within 2 %, on average, of a static oracle for a 16 -core UMA (Uniform Memory Access) platform and surpasses it by 7% on average for a 32 -core NUMA (Non -Uniform
Memory Access) platform.Finally, this thesis also investigates skeleton -driven system- oriented performance
optimizations such as thread mapping and memory page allocation. In order to do
it, the OpenSkel system and also the autotuning mechanism are extended to accommodate these optimizations. The conducted experimental results on a subset of five
applications from the STAMP benchmark show that the OpenSkel framework with the
extended autotuning mechanism driving both pattern and system- oriented optimizations can achieve performance improvements of up to 88 %, with an average of 46 %,
over a baseline version for a 16 -core UMA platform and up to 162 %, with an average
of 91 %, for a 32 -core NUMA platform
VLSI Design
This book provides some recent advances in design nanometer VLSI chips. The selected topics try to present some open problems and challenges with important topics ranging from design tools, new post-silicon devices, GPU-based parallel computing, emerging 3D integration, and antenna design. The book consists of two parts, with chapters such as: VLSI design for multi-sensor smart systems on a chip, Three-dimensional integrated circuits design for thousand-core processors, Parallel symbolic analysis of large analog circuits on GPU platforms, Algorithms for CAD tools VLSI design, A multilevel memetic algorithm for large SAT-encoded problems, etc
Parallel source code transformation techniques using design patterns
Mención Internacional en el tÃtulo de doctorIn recent years, the traditional approaches for improving performance, such as increasing
the clock frequency, has come to a dead-end. To tackle this issue, parallel architectures,
such as multi-/many-core processors, have been envisioned to increase
the performance by providing greater processing capabilities. However, programming
efficiently for this architectures demands big efforts in order to transform sequential
applications into parallel and to optimize such applications. Compared to
sequential programming, designing and implementing parallel applications for operating
on modern hardware poses a number of new challenges to developers such
as data races, deadlocks, load imbalance, etc.
To pave the way, parallel design patterns provide a way to encapsulate algorithmic
aspects, allowing users to implement robust, readable and portable solutions
with such high-level abstractions. Basically, these patterns instantiate parallelism
while hiding away the complexity of concurrency mechanisms, such as thread management,
synchronizations or data sharing. Nonetheless, frameworks following this
philosophy does not share the same interface and users require understanding different
libraries, and their capabilities, not only to decide which fits best for their
purposes but also to properly leverage them. Furthermore, in order to parallelize
these applications, it is necessary to analyze the sequential code in order to detect the
regions of code that can be parallelized that is a time consuming and complex task.
Additionally, different libraries targeted to specific devices provide some algorithms
implementations that are already parallel and highly-tuned. In these situations, it is
also necessary to analyze and determine which routine implementation is the most
suitable for a given problem.
To tackle these issues, this thesis aims at simplifying and minimizing the necessary
efforts to transform sequential applications into parallel. This way, resulting
codes will improve their performance by fully exploiting the available resources
while the development efforts will be considerably reduced. Basically, in this thesis,
we contribute with the following. First, we propose a technique to detect potential
parallel patterns in sequential code. Second, we provide a novel generic C++ interface
for parallel patterns which acts as a switch among existing frameworks. Third,
we implement a framework that is able to transform sequential code into parallel
using the proposed pattern discovery technique and pattern interface. Finally, we
propose mechanisms that are able to select the most suitable device and routine implementation
to solve a given problem based on previous performance information.
The evaluation demonstrates that using the proposed techniques can minimize the
refactoring and optimization time while improving the performance of the resulting
applications with respect to the original code.En los últimos años, las técnicas tradicionales para mejorar el rendimiento, como es
el caso del incremento de la frecuencia de reloj, han llegado a sus lÃmites. Con el
fin de seguir mejorando el rendimiento, se han desarrollado las arquitecturas paralelas,
las cuales proporcionan un incremento del rendimiento al estar provistas de
mayores capacidades de procesamiento. Sin embargo, programar de forma eficiente
para estas arquitecturas requieren de grandes esfuerzos por parte de los desarrolladores.
Comparado con la programación secuencial, diseñar e implementar aplicaciones
paralelas enfocadas a trabajar en estas arquitecturas presentan una gran
cantidad de dificultades como son las condiciones de carrera, los deadlocks o el incorrecto
balanceo de la carga.
En este sentido, los patrones paralelos son una forma de encapsular aspectos
algorÃtmicos de las aplicaciones permitiendo el desarrollo de soluciones robustas,
portables y legibles gracias a las abstracciones de alto nivel. En general, estos patrones
son capaces de proporcionar el paralelismo a la vez que ocultan las complejidades
derivadas de los mecanismos de control de concurrencia necesarios como el
manejo de los hilos, las sincronizaciones o la compartición de datos. No obstante,
los diferentes frameworks que siguen esta filosofÃa no comparten una única interfaz
lo que conlleva que los usuarios deban conocer múltiples bibliotecas y sus capacidades,
con el fin de decidir cuál de ellos es mejor para una situación concreta y
como usarlos de forma eficiente. Además, con el fin de paralelizar aplicaciones existentes,
es necesario analizar e identificar las regiones del código que pueden ser paralelizadas,
lo cual es una tarea ardua y compleja. Además, algunos algoritmos ya se
encuentran implementados en paralelo y optimizados para arquitecturas concretas
en diversas bibliotecas. Esto da lugar a que sea necesario analizar y determinar que
implementación concreta es la más adecuada para solucionar un problema dado.
Para paliar estas situaciones, está tesis busca simplificar y minimizar el esfuerzo
necesario para transformar aplicaciones secuenciales en paralelas. De esta forma,
los códigos resultantes serán capaces de explotar los recursos disponibles a la vez
que se reduce considerablemente el esfuerzo de desarrollo necesario. En general,
esta tesis contribuye con lo siguiente. En primer lugar, se propone una técnica de
detección de patrones paralelos en códigos secuenciales. En segundo lugar, se presenta
una interfaz genérica de patrones paralelos para C++ que permite seleccionar
la implementación de dichos patrones proporcionada por frameworks ya existentes.
En tercer lugar, se introduce un framework de transformación de código secuencial
a paralelo que hace uso de las técnicas de detección de patrones y la interfaz
presentadas. Finalmente, se proponen mecanismos capaces de seleccionar la implementación
más adecuada para solucionar un problema concreto basándose en el
rendimiento obtenido en ejecuciones previas. Gracias a la evaluación realizada se ha
podido demostrar que uso de las técnicas presentadas pueden minimizar el tiempo
necesario para transformar y optimizar el código a la vez que mejora el rendimiento
de las aplicaciones transformadas.Programa Oficial de Doctorado en Ciencia y TecnologÃa InformáticaPresidente: David Expósito Singh.- Secretario: Rafael Asenjo Plaza.- Vocal: Marco Aldinucc
Genetic algorithm based schedulers for grid computing systems
In this paper we present Genetic Algorithms (GAs) based schedulers for efficiently allocating jobs to resources in a Grid system. Scheduling is a key problem in emergent computational systems, such as Grid and P2P, in order to benefit from the large computing capacity of such systems. We present an extensive study on the usefulness of GAs for designing efficient Grid schedulers when makespan and flowtime are minimized. Two encoding schemes have been considered and most of GA operators for each of them are implemented and empirically studied. The extensive experimental study showed that our GA-based schedulers outperform existing GA implementations in the literature for the problem and also revealed their efficiency when makespan and flowtime are minimized either in a hierarchical or a simultaneous optimization mode; previous approaches considered only the minimization of the makespan. Moreover, we were able to identify which GAs versions work best under certain Grid characteristics, which is very useful for real Grids. Our GA-based schedulers are very fast and hence they can be used to dynamically schedule jobs arriving in the Grid system by running in batch mode for a short time.Peer ReviewedPostprint (author's final draft
- …