1,228 research outputs found
Towards a GPU-based implementation of interaction nets
We present ingpu, a GPU-based evaluator for interaction nets that heavily
utilizes their potential for parallel evaluation. We discuss advantages and
challenges of the ongoing implementation of ingpu and compare its performance
to existing interaction nets evaluators.Comment: In Proceedings DCM 2012, arXiv:1403.757
Processor Allocation for Optimistic Parallelization of Irregular Programs
Optimistic parallelization is a promising approach for the parallelization of
irregular algorithms: potentially interfering tasks are launched dynamically,
and the runtime system detects conflicts between concurrent activities,
aborting and rolling back conflicting tasks. However, parallelism in irregular
algorithms is very complex. In a regular algorithm like dense matrix
multiplication, the amount of parallelism can usually be expressed as a
function of the problem size, so it is reasonably straightforward to determine
how many processors should be allocated to execute a regular algorithm of a
certain size (this is called the processor allocation problem). In contrast,
parallelism in irregular algorithms can be a function of input parameters, and
the amount of parallelism can vary dramatically during the execution of the
irregular algorithm. Therefore, the processor allocation problem for irregular
algorithms is very difficult.
In this paper, we describe the first systematic strategy for addressing this
problem. Our approach is based on a construct called the conflict graph, which
(i) provides insight into the amount of parallelism that can be extracted from
an irregular algorithm, and (ii) can be used to address the processor
allocation problem for irregular algorithms. We show that this problem is
related to a generalization of the unfriendly seating problem and, by extending
Tur\'an's theorem, we obtain a worst-case class of problems for optimistic
parallelization, which we use to derive a lower bound on the exploitable
parallelism. Finally, using some theoretically derived properties and some
experimental facts, we design a quick and stable control strategy for solving
the processor allocation problem heuristically.Comment: 12 pages, 3 figures, extended version of SPAA 2011 brief announcemen
A pattern language for parallelizing irregular algorithms
Dissertação apresentada na Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa para obtenção do grau de Mestre em Engenharia InformáticaIn irregular algorithms, data set’s dependences and distributions cannot be statically predicted.
This class of algorithms tends to organize computations in terms of data locality instead of parallelizing control in multiple threads. Thus, opportunities for exploiting parallelism vary dynamically, according to how the algorithm changes data dependences. As such, effective parallelization of such algorithms requires new approaches that account for that dynamic nature.
This dissertation addresses the problem of building efficient parallel implementations of irregular algorithms by proposing to extract, analyze and document patterns of concurrency and parallelism present in the Galois parallelization framework for irregular algorithms.
Patterns capture formal representations of a tangible solution to a problem that arises in a well defined context within a specific domain.
We document the said patterns in a pattern language, i.e., a set of inter-dependent patterns that compose well-documented template solutions that can be reused whenever a certain problem arises in a well-known context
Evaluating multi-core graph algorithm frameworks
Multi-core and GPU-based systems offer unprecedented computational power. They are, however, challenging to utilize effectively, especially when processing irregular data such as graphs. Graphs are of great interest, as they are now used to model geographic-, social- andneural networks. Several interesting programming frameworks for graph processing have therefore been developed these past few years.
In this work, we highlight the strengths and weaknesses of the Galois, GraphBLAST, Gunrock and Ligra graph frameworks through benchmarking their single source shortest path (SSSP) implementations using the SuiteSparse Matrix Collection. Tests were done on an Nvidia DGX2 system, except for Ligra, which only provides a multi-core framework. D-IrGL, built on Galois, also provided a multi-GPU option for SSSP. We also look at program size, documentation and overall ease of use.
High performance generally comes at the price of high complexity. D-IrGL shows its strength on the very largest graphs, where it achieved the best run-time, while Gunrock processed most other large sets the fastest. However, GraphBLAST, with a relatively low-complexity interface, achieves the greatest median throughput across all our test cases. This despite that its SSSP implementation size is only 1/10th of Gunrock, which for our tests has the highest peak throughput and the fastest run-time in most cases. Ligra had less computational resources available, and consequently performed worse in most cases, but it is also a very compact and easy to use framework. Futher analyses and some suggestions for future work are also included
Quantum ESPRESSO: a modular and open-source software project for quantum simulations of materials
Quantum ESPRESSO is an integrated suite of computer codes for
electronic-structure calculations and materials modeling, based on
density-functional theory, plane waves, and pseudopotentials (norm-conserving,
ultrasoft, and projector-augmented wave). Quantum ESPRESSO stands for "opEn
Source Package for Research in Electronic Structure, Simulation, and
Optimization". It is freely available to researchers around the world under the
terms of the GNU General Public License. Quantum ESPRESSO builds upon
newly-restructured electronic-structure codes that have been developed and
tested by some of the original authors of novel electronic-structure algorithms
and applied in the last twenty years by some of the leading materials modeling
groups worldwide. Innovation and efficiency are still its main focus, with
special attention paid to massively-parallel architectures, and a great effort
being devoted to user friendliness. Quantum ESPRESSO is evolving towards a
distribution of independent and inter-operable codes in the spirit of an
open-source project, where researchers active in the field of
electronic-structure calculations are encouraged to participate in the project
by contributing their own codes or by implementing their own ideas into
existing codes.Comment: 36 pages, 5 figures, resubmitted to J.Phys.: Condens. Matte
Amorphous slicing of extended finite state machines
Slicing is useful for many Software Engineering applications and has been widely studied for three decades, but there has been comparatively little work on slicing Extended Finite State Machines (EFSMs). This paper introduces a set of dependency based EFSM slicing algorithms and an accompanying tool. We demonstrate that our algorithms are suitable for dependence based slicing. We use our tool to conduct experiments on ten EFSMs, including benchmarks and industrial EFSMs. Ours is the first empirical study of dependence based program slicing for EFSMs. Compared to the only previously published dependence based algorithm, our average slice is smaller 40% of the time and larger only 10% of the time, with an average slice size of 35% for termination insensitive slicing
Automatic skeleton-driven performance optimizations for transactional memory
The recent shift toward multi -core chips has pushed the burden of extracting performance to the programmer. In fact, programmers now have to be able to uncover more
coarse -grain parallelism with every new generation of processors, or the performance
of their applications will remain roughly the same or even degrade. Unfortunately,
parallel programming is still hard and error prone. This has driven the development of
many new parallel programming models that aim to make this process efficient.This thesis first combines the skeleton -based and transactional memory programming models in a new framework, called OpenSkel, in order to improve performance
and programmability of parallel applications. This framework provides a single skeleton that allows the implementation of transactional worklist applications. Skeleton or
pattern-based programming allows parallel programs to be expressed as specialized instances of generic communication and computation patterns. This leaves the programmer with only the implementation of the particular operations required to solve the
problem at hand. Thus, this programming approach simplifies parallel programming
by eliminating some of the major challenges of parallel programming, namely thread
communication, scheduling and orchestration. However, the application programmer
has still to correctly synchronize threads on data races. This commonly requires the
use of locks to guarantee atomic access to shared data. In particular, lock programming
is vulnerable to deadlocks and also limits coarse grain parallelism by blocking threads
that could be potentially executed in parallel.Transactional Memory (TM) thus emerges as an attractive alternative model to simplify parallel programming by removing this burden of handling data races explicitly.
This model allows programmers to write parallel code as transactions, which are then
guaranteed by the runtime system to execute atomically and in isolation regardless of
eventual data races. TM programming thus frees the application from deadlocks and
enables the exploitation of coarse grain parallelism when transactions do not conflict
very often. Nevertheless, thread management and orchestration are left for the application programmer. Fortunately, this can be naturally handled by a skeleton framework.
This fact makes the combination of skeleton -based and transactional programming a
natural step to improve programmability since these models complement each other.
In fact, this combination releases the application programmer from dealing with thread
management and data races, and also inherits the performance improvements of both
models. In addition to it, a skeleton framework is also amenable to skeleton - driven
iii
performance optimizations that exploits the application pattern and system information.This thesis thus also presents a set of pattern- oriented optimizations that are automatically selected and applied in a significant subset of transactional memory applications that shares a common pattern called worklist. These optimizations exploit the
knowledge about the worklist pattern and the TM nature of the applications to avoid
transaction conflicts, to prefetch data, to reduce contention etc. Using a novel autotuning mechanism, OpenSkel dynamically selects the most suitable set of these patternoriented performance optimizations for each application and adjusts them accordingly.
Experimental results on a subset of five applications from the STAMP benchmark suite
show that the proposed autotuning mechanism can achieve performance improvements
within 2 %, on average, of a static oracle for a 16 -core UMA (Uniform Memory Access) platform and surpasses it by 7% on average for a 32 -core NUMA (Non -Uniform
Memory Access) platform.Finally, this thesis also investigates skeleton -driven system- oriented performance
optimizations such as thread mapping and memory page allocation. In order to do
it, the OpenSkel system and also the autotuning mechanism are extended to accommodate these optimizations. The conducted experimental results on a subset of five
applications from the STAMP benchmark show that the OpenSkel framework with the
extended autotuning mechanism driving both pattern and system- oriented optimizations can achieve performance improvements of up to 88 %, with an average of 46 %,
over a baseline version for a 16 -core UMA platform and up to 162 %, with an average
of 91 %, for a 32 -core NUMA platform
- …