3,324 research outputs found
Recommended from our members
Percolation-based compiling for evaluation of parallelism and hardware design trade-offs
This thesis investigates parallelism and hardware design trade-offs of parallel and pipelined architectures. To explore these trade-offs we developed a retargetable compiler based on a set of powerful code transformations called Percolation Scheduling (PS) that map programs with real-time constraints and/or massive time requirements onto synchronous, parallel, high-performance or semi-custom architectures.High-performance is achieved through extraction of application inherent fine-grain parallelism and the use of a suitable architecture. Exploiting fine-grain parallelism is a critical part of exploiting all of the parallelism available in a given program, particularly since highly irregular forms of parallelism are often not visible at coarser levels and since the use of low-level parallelism has a multiplicative effect on the overall performance.To extract substantial parallelism from both the hardware and the compiler, we use a clean, highly parallel VLIW-like architecture that is synchronous, has multiple functional units and has a single program counter. The use of a hazard-free and homogeneous architecture does not result only in a better VLSI design but also considerably increases the compiler's ability to produce better code. To further enhance parallelism we modified the uni-cycle VLIW model and extended the transformations such that pipelined units that provide extra parallelism are used.Another approach presented is of resource constrained scheduling (RCS). Since the RCS problem is known to be NP-hard, in practice it may be solved only by a heuristic approach. We argue that using the heuristic after extraction of the unlimited-resources schedule may yield better results than if the heuristic has been applied at the beginning of the scheduling process.Through a series of benchmarks we evaluate hardware design trade-offs and show that speed-ups on average of one order of magnitude are feasible with sufficient functional units. However, when resources are limited we show that the number of functional units needed may be optimized for a particular suite of application programs
On-the-fly tracing for data-centric computing : parallelization, workflow and applications
As data-centric computing becomes the trend in science and engineering, more and more hardware systems, as well as middleware frameworks, are emerging to handle the intensive computations associated with big data. At the programming level, it is crucial to have corresponding programming paradigms for dealing with big data. Although MapReduce is now a known programming model for data-centric computing where parallelization is completely replaced by partitioning the computing task through data, not all programs particularly those using statistical computing and data mining algorithms with interdependence can be re-factorized in such a fashion. On the other hand, many traditional automatic parallelization methods put an emphasis on formalism and may not achieve optimal performance with the given limited computing resources. In this work we propose a cross-platform programming paradigm, called on-the-fly data tracing , to provide source-to-source transformation where the same framework also provides the functionality of workflow optimization on larger applications. Using a big-data approximation computations related to large-scale data input are identified in the code and workflow and a simplified core dependence graph is built based on the computational load taking in to account big data. The code can then be partitioned into sections for efficient parallelization; and at the workflow level, optimization can be performed by adjusting the scheduling for big-data considerations, including the I/O performance of the machine. Regarding each unit in both source code and workflow as a model, this framework enables model-based parallel programming that matches the available computing resources. The techniques used in model-based parallel programming as well as the design of the software framework for both parallelization and workflow optimization as well as its implementations with multiple programming languages are presented in the dissertation. Then, the following experiments are performed to validate the framework: i) the benchmarking of parallelization speed-up using typical examples in data analysis and machine learning (e.g. naive Bayes, k-means) and ii) three real-world applications in data-centric computing with the framework are also described to illustrate the efficiency: pattern detection from hurricane and storm surge simulations, road traffic flow prediction and text mining from social media data. In the applications, it illustrates how to build scalable workflows with the framework along with performance enhancements
The hArtes Tool Chain
This chapter describes the different design steps needed to go from legacy code to a transformed application that can be efficiently mapped on the hArtes platform
Getting to the Point. Index Sets and Parallelism-Preserving Autodiff for Pointful Array Programming
We present a novel programming language design that attempts to combine the
clarity and safety of high-level functional languages with the efficiency and
parallelism of low-level numerical languages. We treat arrays as
eagerly-memoized functions on typed index sets, allowing abstract function
manipulations, such as currying, to work on arrays. In contrast to composing
primitive bulk-array operations, we argue for an explicit nested indexing style
that mirrors application of functions to arguments. We also introduce a
fine-grained typed effects system which affords concise and
automatically-parallelized in-place updates. Specifically, an associative
accumulation effect allows reverse-mode automatic differentiation of in-place
updates in a way that preserves parallelism. Empirically, we benchmark against
the Futhark array programming language, and demonstrate that aggressive
inlining and type-driven compilation allows array programs to be written in an
expressive, "pointful" style with little performance penalty.Comment: 31 pages with appendix, 11 figures. A conference submission is still
under revie
Recommended from our members
A Survey of Parallel Programming Constructs
This paper surveys the types of parallelism found in Functional, Lisp and Object-Oriented languages. In particular, it concentrates on the addition of high level parallel constructs to these types of languages. The traditional area of the automatic extraction of parallelism by a compiler [39] is ignored here in favor of the addition of new constructs, because the long history of such automatic techniques has shown that they are not sufficient to allow the massive parallelism promised from modem computer architectures [26. 58]. The problem then, simply stated, is given that it is now possible for us to build massively parallel machines and given that our current compilers seem incapable of generating sufficient parallelism automatically, what should the language designer do? A reasonable answer seems to be to add constructs to languages that allow the expression of additional parallelism in a natural way. Indeed that is what the designers of the the Functional, Lisp, and Object-Oriented languages described below have attempted to do. The three particular programming formalisms were picked because most of the initial ideas seem to have been generated by the designers of the functional languages and most of the current activity seems to be in the Lisp and Objected-Oriented domains. There is also a great deal of activity in the Logic programming area, but this activity is more in the area of executing the existing constructs in parallel as opposed to adding constructs specifically designed to increase parallelism
A Theoretical Approach Involving Recurrence Resolution, Dependence Cycle Statement Ordering and Subroutine Transformation for the Exploitation of Parallelism in Sequential Code.
To exploit parallelism in Fortran code, this dissertation consists of a study of the following three issues: (1) recurrence resolution in Do-loops for vector processing, (2) dependence cycle statement ordering in Do-loops for parallel processing, and (3) sub-routine parallelization. For recurrence resolution, the major findings include: (1) the node splitting algorithm cannot be used directly to break an essential antidependence link, of which the source variable that results in antidependence is itself the sink variable of another true dependence so a correction method is proposed, (2) a sink variable renaming technique is capable of breaking an antidependence and/or output-dependence link, (3) for recurrences formed by only true dependences, a dynamic dependence concept and the derived technique are powerful, and (4) by integrating related techniques, an algorithm for resolving a general multistatement recurrence is developed. The performance of a parallel loop is determined by the level of parallelism and the time delay due to interprocessor communication and synchronization. For a dependence cycle of a single parallel loop executed in a general synchronization mode, the parallelism exposed varies with the alignment of statements. Statements are reordered on the basis of execution-time of the loop as estimated at compile-time. An improved timing formula and a derived statement ordering algorithm are proposed. Further extension of this algorithm to multiple perfectly nested Do-loops with simple global dependence cycle is also presented. The subroutine is a potential source for parallel processing. Several problems must be solved for subroutine parallelization: (1) the precedence of parallel executions of subroutines, (2) identification of the optimum execution mode for each subroutine and (3) the restructuring of a serial program. A five-step approach to parallelize called subroutines for a calling subroutine is proposed: (1) computation of control dependence, (2) approximation of the global effects of subroutines, (3) analysis of data dependence, (4) identification of execution mode, and (5) restructuring of calling and called subroutines. Application of these five steps in a recursive manner to different levels of calling subroutines in a program addresses the parallelization of subroutines
- …