37 research outputs found
A Theoretical Approach Involving Recurrence Resolution, Dependence Cycle Statement Ordering and Subroutine Transformation for the Exploitation of Parallelism in Sequential Code.
To exploit parallelism in Fortran code, this dissertation consists of a study of the following three issues: (1) recurrence resolution in Do-loops for vector processing, (2) dependence cycle statement ordering in Do-loops for parallel processing, and (3) sub-routine parallelization. For recurrence resolution, the major findings include: (1) the node splitting algorithm cannot be used directly to break an essential antidependence link, of which the source variable that results in antidependence is itself the sink variable of another true dependence so a correction method is proposed, (2) a sink variable renaming technique is capable of breaking an antidependence and/or output-dependence link, (3) for recurrences formed by only true dependences, a dynamic dependence concept and the derived technique are powerful, and (4) by integrating related techniques, an algorithm for resolving a general multistatement recurrence is developed. The performance of a parallel loop is determined by the level of parallelism and the time delay due to interprocessor communication and synchronization. For a dependence cycle of a single parallel loop executed in a general synchronization mode, the parallelism exposed varies with the alignment of statements. Statements are reordered on the basis of execution-time of the loop as estimated at compile-time. An improved timing formula and a derived statement ordering algorithm are proposed. Further extension of this algorithm to multiple perfectly nested Do-loops with simple global dependence cycle is also presented. The subroutine is a potential source for parallel processing. Several problems must be solved for subroutine parallelization: (1) the precedence of parallel executions of subroutines, (2) identification of the optimum execution mode for each subroutine and (3) the restructuring of a serial program. A five-step approach to parallelize called subroutines for a calling subroutine is proposed: (1) computation of control dependence, (2) approximation of the global effects of subroutines, (3) analysis of data dependence, (4) identification of execution mode, and (5) restructuring of calling and called subroutines. Application of these five steps in a recursive manner to different levels of calling subroutines in a program addresses the parallelization of subroutines
A Survey on Thread-Level Speculation Techniques
Producción CientíficaThread-Level Speculation (TLS) is a promising technique that allows the parallel execution of sequential code without relying on a prior, compile-time-dependence analysis. In this work, we introduce the technique, present a taxonomy of TLS solutions, and summarize and put into perspective the most relevant advances in this field.MICINN (Spain) and ERDF program of the European Union: HomProg-HetSys project (TIN2014-58876-P), CAPAP-H5 network (TIN2014-53522-REDT), and COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS)
Recommended from our members
HELIX: Automatic Parallelization of Irregular Programs for Chip Multiprocessing.
We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel® Core i7-980X, HELIX achieves speedups averaging 2.25 x, with a maximum of 4.12x, for thirteen C benchmarks from SPEC CPU2000.Engineering and Applied Science
Exploiting tightly-coupled cores
This is the published manuscript. It was first published by Springer in the Journal of Signal Processing Systems here: http://link.springer.com/article/10.1007%2Fs11265-014-0944-6.The individual processors of a chip-multiprocessor
traditionally have rigid boundaries. Inter-core communication is
only possible via memory and control over a core’s resources is
localised. Specialisation necessary to meet today’s challenging
energy targets is typically provided through the provision of
a range of processor types and accelerators. An alternative
approach is to permit specialisation by tailoring the way a large
number of homogeneous cores are used. The approach here
is to relax processor boundaries, create a richer mix of intercore
communication mechanisms and provide finer-grain control
over, and access to, the resources of each core. We evaluate one
such design, called Loki, that aims to support specialisation in
software on a homogeneous many-core architecture. We focus
on the design of a single 8-core tile, conceived as the building
block for a larger many-core system. We explore the tile’s ability
to support a range of parallelisation opportunities and detail
the control and communication mechanisms needed to exploit
each core’s resources in a flexible manner. Performance and a
detailed breakdown of energy usage is provided for a range of
benchmarks and configurations.This work was supported by EPSRC grant EP/G033110/1
An approach to task-based parallel programming for undergraduate students
This paper presents the description of a compulsory parallel programming course in the bachelor degree in Informatics Engineering at the Barcelona School of Informatics, Universitat Politècnica de Catalunya UPC-BarcelonaTech. The main focus of the course is on the shared-memory programming paradigm, which facilitates the presentation of fundamental aspects and notions of parallel computing. Unlike the “traditional” loop-based approach, which is the focus of parallel programming courses in other universities, this course presents the parallel programming concepts using a task-based approach. Tasking allows students to explore a broader set of parallel decomposition strategies, including linear, iterative and recursive strategies, and their implementation using the current version of OpenMP (OpenMP 4.5), which offers mechanisms (pragmas and intrinsic functions) to easily map these strategies into parallel programs. Simple models to understand the benefits of a task decomposition and the trade-offs introduced by different kinds of overheads are included in the course, together with the use of tools that allow an easy exploration of different task decomposition strategies and their potential parallelism (Tareador) and instrumentation and analysis of task parallel executions on real machines (Extrae and Paraver).This work has been supported by the grant SEV-2015-0493 of the Severo Ochoa Program, awarded by the Spanish Gov-
ernment, by the Spanish Ministry of Science and Innovation (contract TIN2015-65316-P) and by Generalitat de Catalunya (contracts 2014-MOOC-00057 and 2014-SGR-1051). We also thank the anonymous reviewers and editor for their comments during the review process, other professors that have been in-
volved in the implementation of the course and Paul Carpenter at BSC for his corrections and suggestions to improve the text.Postprint (published version
Discovery of Potential Parallelism in Sequential Programs
In the era of multicore processors, the responsibility for performance gains has been shifted onto software developers. Once improvements of the sequential algorithm have been exhausted, software-managed parallelism is the only option left. However, writing parallel code is still difficult, especially when parallelizing sequential code written by someone else. A key task in this process is the identification of suitable parallelization targets in the source code. Parallelism discovery tools help developers to find such targets automatically. Unfortunately, tools that identify parallelism during compilation are usually conservative due to the lack of runtime information, and tools relying on runtime information primarily suffer from high overhead in terms of both time and memory. This dissertation presents a generic framework for parallelism discovery based on dynamic program analysis, supporting various types of parallelism while incurring practically affordable overhead. The framework contains two main components: an efficient data-dependence profiler and a set of parallelism discovery algorithms based on a language-independent concept called Computational Unit.
The data-dependence profiler serves as the foundation of the parallelism discovery framework. Traditional dependence profiling approaches introduce a tremendous amount of time and memory overhead. To lower the overhead, current methods limit their scope to the subset of the dependence information needed for the analysis they have been created for, sacrificing generality and discouraging reuse. In contrast, the profiler shown in this thesis addresses the problem via signature-based memory management and a lock-free parallel design. It produces detailed dependences not only for sequential but also for multi-threaded code without causing prohibitive overhead, allowing it to serve as a generic base for various program analysis techniques.
Computational Units (CUs) provide a language-independent foundation for parallelism discovery. CUs are computations that follow the read-compute-write pattern. Unlike other concepts, they are not restricted to predefined language constructs. A program is represented as a CU graph, in which vertexes are CUs and edges are data dependences. This allows parallelism to be detected that spreads across multiple language constructs, taking code refactoring into consideration. The parallelism discovery algorithms cover both loop and task parallelism.
Results of our experiments show that 1) the efficient data-dependence profiler has a very competitive average slowdown of around 80× with accuracy higher than 99.6%; 2) the framework discovers parallelism with high accuracy, identifying 92.5% of the parallel loops in NAS benchmarks; 3) when parallelizing well-known open-source software following the outputs of the framework, reasonable speedups are obtained. Finally, use cases beyond parallelism discovery are briefly demonstrated to show the generality of the framework
Recommended from our members
Loop unrolling for parallel systems
In this paper we study the problem of scheduling parallel program tasks that are enclosed in a set of nested loops on parallel computers. We first introduce a representation of the tasks and their relations in a loop. The representation allows us to express loop-carried data dependences among tasks as well as loop information that cannot be represented using ordinary task graphs. We also introduce a new technique for scheduling unrolled loops onto arbitrary target machines. Our technique allows several iterations of a set of loops as well as tasks within the same iteration to overlap in execution in a way that minimizes the loop completion time. We use local neighborhood search and simulated annealing optimization methods to find the best way to unroll a set of nested loops. Our goal is to find which loops to unroll and for how many times, and the Gantt chart that indicates the allocation and the order of the tasks in the post-unrolling loop on the available processors
Recommended from our members
Guided Automatic Binary Parallelisation
For decades, the software industry has amassed a vast repository of pre-compiled libraries and executables which are still valuable and actively in use. However, for a significant fraction of these binaries, most of the source code is absent or is written in old languages, making it practically impossible to recompile them for new generations of hardware. As the number of cores in chip multi-processors (CMPs) continue to scale, the performance of this legacy software becomes increasingly sub-optimal. Rewriting new optimised and parallel software would be a time-consuming and expensive task. Without source code, existing automatic performance enhancing and parallelisation techniques are not applicable for legacy software or parts of new applications linked with legacy libraries.
In this dissertation, three tools are presented to address the challenge of optimising legacy binaries. The first, GBR (Guided Binary Recompilation), is a tool that recompiles stripped application binaries without the need for the source code or relocation information. GBR performs static binary analysis to determine how recompilation should be undertaken, and produces a domain-specific hint program. This hint program is loaded and interpreted by the GBR dynamic runtime, which is built on top of the open-source dynamic binary translator, DynamoRIO. In this manner, complicated recompilation of the target binary is carried out to achieve optimised execution on a real system. The problem of limited dataflow and type information is addressed through cooperation between the hint program and JIT optimisation. The utility of GBR is demonstrated by software prefetch and vectorisation optimisations to achieve performance improvements compared to their original native execution.
The second tool is called BEEP (Binary Emulator for Estimating Parallelism), an extension to GBR for binary instrumentation.
BEEP is used to identify potential thread-level parallelism through static binary analysis and binary instrumentation.
BEEP performs preliminary static analysis on binaries and encodes all statically-undecided questions into a hint program.
The hint program is interpreted by GBR so that on-demand binary instrumentation codes are inserted to answer the questions from runtime information.
BEEP incorporates a few parallel cost models to evaluate identified parallelism under different parallelisation paradigms.
The third tool is named GABP (Guided Automatic Binary Parallelisation), an extension to GBR for parallelisation. GABP focuses on loops from sequential application binaries and automatically extracts thread-level parallelism from them on-the-fly, under the direction of the hint program, for efficient parallel execution. It employs a range of runtime schemes, such as thread-level speculation and synchronisation, to handle runtime data dependences. GABP achieves a geometric mean of speedup of 1.91x on binaries from SPEC CPU2006 on a real x86-64 eight-core system compared to native sequential execution. Performance is obtained for SPEC CPU2006 executables compiled from a variety of source languages and by different compilers.St John's Benefactor Scholarship
ARM Sponsorshi