146 research outputs found
The exploitation of parallelism on shared memory multiprocessors
PhD ThesisWith the arrival of many general purpose shared memory multiple processor
(multiprocessor) computers into the commercial arena during the mid-1980's, a
rift has opened between the raw processing power offered by the emerging
hardware and the relative inability of its operating software to effectively deliver
this power to potential users. This rift stems from the fact that, currently, no
computational model with the capability to elegantly express parallel activity is
mature enough to be universally accepted, and used as the basis for programming
languages to exploit the parallelism that multiprocessors offer. To add to this,
there is a lack of software tools to assist programmers in the processes of designing
and debugging parallel programs.
Although much research has been done in the field of programming languages,
no undisputed candidate for the most appropriate language for programming
shared memory multiprocessors has yet been found. This thesis examines why this
state of affairs has arisen and proposes programming language constructs,
together with a programming methodology and environment, to close the ever
widening hardware to software gap.
The novel programming constructs described in this thesis are intended for use
in imperative languages even though they make use of the synchronisation
inherent in the dataflow model by using the semantics of single assignment when
operating on shared data, so giving rise to the term shared values. As there are
several distinct parallel programming paradigms, matching flavours of shared
value are developed to permit the concise expression of these paradigms.The Science and Engineering Research Council
Constraint programming on a heterogeneous multicore architecture
As bibliotecas para programação com restrições são úteis ao desenvolverem-se aplicações em linguagens de programação normalmente mais utilizadas pois não necessitam que os programadores aprendam uma. Nova, linguagem, fornecendo ferramentas de programação declarativa para utilização com os sistemas convencionais. Algumas soluções para programação com restrições favorecem completude, tais como sistemas baseados em propagação. Outras estão mais interessadas em obter uma boa solução rapidamente, rejeitando a necessidade de encontram todas as soluções; esta sendo a alternativa utilizada nos sistemas de pesquisa local. Conceber soluções híbridas (propagação + pesquisa local) parece prometedor pois as vantagens de ambas alternativas podem ser combinadas numa única solução. As arquiteturas paralelas são cada vez mais comuns, em parte devido à disponibilidade em grande escala, de sistemas individuais mas também devido à tendência em generalizar o uso de processadores multicore ou seja., processadores com várias unidades de processamento. Nesta tese é proposta uma. Arquitetura para resolvedores de restrições mistos, de pendendo de métodos de propagação e pesquisa local, a qual foi concebida para funcionar eficazmente numa arquitetura. Heterogéneo multiprocessador. /ABSTRACT - Constraint programming libraries are useful when building applications developed mostly in mainstrearn programming languages: they do not require the developers to acquire skills for a new language, providing instead declarative programming tools for use within conventional systems. Some approaches to constraint programming favour completeness, such as propagation-based systems. Others are more interested in getting to a good solution fast, regardless of whether all solutions may be found; this approach is used in local search systems. Designing hybrid approaches (propagation + local search) seems promising since the advantages may be combined into a single approach. Parallel architectures are becoming more commonplace, partly due to the large-scale availability of individual systems but also because of the trend towards generalizing the use of multicore microprocessors. In this thesis an architecture for mixed constraint solvers is proposed, relying both on propagation and local search, which is designed to function effectively in a heterogeneous multicore architecture
The Nornir run-time system for parallel programs using Kahn process networks on multi-core machines – A flexible alternative to MapReduce
Even though shared-memory concurrency is a paradigm frequently used for developing parallel applications on small- and middle-sized machines, experience has shown that it is hard to use. This is largely caused by synchronization primitives which are low-level, inherently non-deterministic, and, consequently, non-intuitive to use. In this paper, we present the Nornir run-time system. Nornir is comparable to well-known frameworks such as MapReduce and Dryad that are recognized for their efficiency and simplicity. Unlike these frameworks, Nornir also supports process structures containing branches and cycles. Nornir is based on the formalism of Kahn process networks, which is a shared-nothing, message-passing model of concurrency. We deem this model a simple and deterministic alternative to shared-memory concurrency. Experiments with real and synthetic benchmarks on up to 8 CPUs show that performance in most cases scales almost linearly with the number of CPUs, when not limited by data dependencies. We also show that the modeling flexibility allows Nornir to outperform its MapReduce counterparts using well-known benchmarks.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited
Massivel y parallel declarative computational models
Current computer archictectures are parallel, with an increasing number of processors. Parallel programming is an error-prone task and declarative models such as those based on constraints relieve the programmer from some of its difficult aspects, because they abstract control away. In this work we study and develop techniques for declarative computational models based on constraints using GPI, aiming at large scale parallel execution. The main contributions of this work are: A GPI implementation of a scalable dynamic load balancing scheme based on work
stealing, suitable for tree shaped computations and effective for systems with thousands of threads. A parallel constraint solver, MaCS, implemented to take advantage of the GPI programming model. Experimental evaluation shows very good scalability results on systems with hundreds of cores. A GPI parallel version of the Adaptive Search algorithm, including different variants. The study on different problems advances the understanding of scalability issues known to exist with large numbers of cores; ### SUMÁRIO: Actualmente as arquitecturas de computadores são paralelas, com um crescente número de processadores. A programação paralela é uma tarefa propensa a erros e modelos declarativos baseados em restrições
aliviam o programador de aspectos difíceis dado que abstraem o controlo. Neste trabalho estudamos e desenvolvemos técnicas para modelos de computação declarativos baseados em restrições usando o GPI, uma ferramenta e modelo de programação recente. O Objectivo é a execução paralela em larga escala. As contribuições deste trabalho são as seguintes: a implementação de um esquema dinâmico para balanceamento da computação baseado no GPI. O esquema é adequado para computações em árvores e efectiva em sistemas compostos por milhares de unidades de computação. Uma abordagem à resolução paralela de restrições denominadas de MaCS, que tira partido do modelo de programação do GPI. A Avaliação experimental revelou boa escalabilidade num sistema com centenas de processadores. Uma versão paralela do algoritmo Adaptive Search baseada no GPI, que inclui diferentes variantes. O estudo de diversos problemas aumenta a compreensão de aspectos relacionados com a escalabilidade e presentes na execução deste tipo de algoritmos num grande número de processadores
Adaptive architecture-transparent policy control in a distributed graph reducer
The end of the frequency scaling era occured around 2005 as the clock frequency
has stalled for commodity architectures. Thus performance improvements that could
in the past be expected with each new hardware generation needed to originate
elsewhere. Almost all computer architectures exhibit substantial and growing levels
of parallelism, exploiting which became one of the key sources of performance and
scalability improvements. Alas, parallel programming proved much more difficult
than sequential, due to the need to specify coordination and parallelism management
aspects. Whilst low-level languages place the burden on the programmers reducing
productivity and portability, semi-implicit approaches delegate the responsibility to
sophisticated compilers and run-time systems.
This thesis presents a study of adaptive load distribution based on work stealing
using history and ancestry information in a distributed graph reducer for a nonstrict functional language. The results contribute to the exploration of more flexible
run-time-system-level parallelism control implementing a semi-explicit model of parallelism, which offers productivity and high level of abstraction by delegating the
responsibility of coordination to the run-time system.
After characterising a set of parallel functional applications, we study the use of
historical information to adapt the choice of the victim to steal from in a work stealing scheduler. We observe substantially lower numbers of messages for data-parallel
and nested applications. However, this heuristic fails in cases where past application behaviour is not resembling future behaviour, for instance for Divide-&-Conquer
applications with a large number of very fine-grained threads and generators of parallelism that move dynamically across processing elements. This mechanism is not
specific to the language and the run-time system, and applies to other work stealing
schedulers.
Next, we focus on the other key work stealing decision of which sparks that represent potential parallelism to donate, investigating the effect of Spark Colocation
on the performance of five Divide-&-Conquer programs run on a cluster of up to
256 PEs. When using Spark Colocation, the distributed graph reducer shares related
work resulting in a higher degree of both potential and actual parallelism, and more
fine-grained and less variable thread size. We validate this behaviour by observing
a reduction in average fetch times, but increased amounts of FETCH messages and
of inter-PE pointers for colocation, which nevertheless results in improved load balance for three of the five benchmark programs. The results show high speedups and
speedup improvements for Spark Colocation for the three more regular and nested
applications and performance degradation for two programs: one that is excessively
fine-grained and one exhibiting limited scalability. Overall, Spark Colocation appears most beneficial for higher numbers of PEs, where improved load balance and
higher degree of parallelism have more opportunities to pay off.
In more general terms, we show that a run-time system can beneficially use historical information on past stealing successes that is gathered dynamically and used
within the same run and the ancestry information dynamically reconstructed at run
time using annotations. Moreover, the results support the view that different heuristics are beneficial for applications using different parallelism patterns, underlining
the advantages of a flexible architecture-transparent approach.The Scottish Informatics and Computer Science Alliance (SICSA
Run-time support for parallel object-oriented computing: the NIP lazy task creation technique and the NIP object-based software distributed shared memory
PhD ThesisAdvances in hardware technologies combined with decreased costs
have started a trend towards massively parallel architectures that utilise
commodity components. It is thought unreasonable to expect software
developers to manage the high degree of parallelism that is made
available by these architectures. This thesis argues that a new
programming model is essential for the development of parallel
applications and presents a model which embraces the notions of
object-orientation and implicit identification of parallelism. The new
model allows software engineers to concentrate on development issues,
using the object-oriented paradigm, whilst being freed from the burden
of explicitly managing parallel activity.
To support the programming model, the semantics of an execution
model are defined and implemented as part of a run-time support
system for object-oriented parallel applications. Details of the novel
techniques from the run-time system, in the areas of lazy task creation
and object-based, distributed shared memory, are presented.
The tasklet construct for representing potentially parallel
computation is introduced and further developed by this thesis. Three
caching techniques that take advantage of memory access patterns
exhibited in object-oriented applications are explored. Finally, the
performance characteristics of the introduced run-time techniques are
analysed through a number of benchmark applications
Directions in parallel programming: HPF, shared virtual memory and object parallelism in pC++
Fortran and C++ are the dominant programming languages used in scientific computation. Consequently, extensions to these languages are the most popular for programming massively parallel computers. We discuss two such approaches to parallel Fortran and one approach to C++. The High Performance Fortran Forum has designed HPF with the intent of supporting data parallelism on Fortran 90 applications. HPF works by asking the user to help the compiler distribute and align the data structures with the distributed memory modules in the system. Fortran-S takes a different approach in which the data distribution is managed by the operating system and the user provides annotations to indicate parallel control regions. In the case of C++, we look at pC++ which is based on a concurrent aggregate parallel model
Parallel functional programming for message-passing multiprocessors
We propose a framework for the evaluation of implicitly parallel functional programs on message passing multiprocessors with special emphasis on the issue of load bounding. The model is based on a new encoding of the lambda-calculus in Milner's pi-calculus and combines lazy evaluation and eager (parallel) evaluation in the same framework. The pi-calculus encoding serves as the specification of a more concrete compilation scheme mapping a simple functional language into a message passing, parallel program. We show how and under which conditions we can guarantee successful load bounding based on this compilation scheme. Finally we discuss the architectural requirements for a machine to support our model efficiently and we present a simple RISC-style processor architecture which meets those criteria
- …