Search CORE

19 research outputs found

Porting the Sisal functional language to distributed-memory multiprocessors

Author: Ku Jui-Yuan
Publication venue: Digital Commons @ NJIT
Publication date: 31/05/1999
Field of study

Parallel computing is becoming increasingly ubiquitous in recent years. The sizes of application problems continuously increase for solving real-world problems. Distributed-memory multiprocessors have been regarded as a viable architecture of scalable and economical design for building large scale parallel machines. While these parallel machines can provide computational capabilities, programming such large-scale machines is often very difficult due to many practical issues including parallelization, data distribution, workload distribution, and remote memory latency. This thesis proposes to solve the programmability and performance issues of distributed-memory machines using the Sisal functional language. The programs written in Sisal will be automatically parallelized, scheduled and run on distributed-memory multiprocessors with no programmer intervention. Specifically, the proposed approach consists of the following steps. Given a program written in Sisal, the front end Sisal compiler generates a directed acyclic graph(DAG) to expose parallelism in the program. The DAG is partitioned and scheduled based on loop parallelism. The scheduled DAG is then translated to C programs with machine specific parallel constructs. The parallel C programs are finally compiled by the target machine specific compilers to generate executables. A distributed-memory parallel machine, the 80-processor ETL EM-X, has been chosen to perform experiments. The entire procedure has been implemented on the EMX multiprocessor. Four problems are selected for experiments: bitonic sorting, search, dot-product and Fast Fourier Transform. Preliminary execution results indicate that automatic parallelization of the Sisal programs based on loop parallelism is effective. The speedup for these four problems is ranging from 17 to 60 on a 64-processor EM-X. Preliminary experimental results further indicate that programming distributed-memory multiprocessors using a functional language indeed frees the programmers from lowl-evel programming details while allowing them to focus on algorithmic performance improvement

Digital Commons @ New Jersey Institute of Technology (NJIT)

A graphical system for parallel software development

Author: Allen Robert Stephen
Publication venue: Newcastle University
Publication date: 01/01/1997
Field of study

PhD ThesisParallel architectures have become popular in the race to meet an increasing demand for computational power. While the benefits of parallel computing - the performance improvements due to simultaneous computations - are clear, achieving these benefits has proved difficult. The wide variety of parallel architectures has led to a similarly diverse range of parallel languages and methods for parallel programming, many of which feature complicated architecture-specific language mechanisms. The lack of good tools to assist in the development of parallel software has compounded the problem of parallel programming being limited to a field which is both specialist and fragmented. This thesis investigates techniques for the graphical specification of parallel programs, using an architecture-independent graph-based notation representing the design of the program, combined with conventional sequential languages. Automatic code generation is used to translate the graph program into executable code suitable for different parallel architectures. To overcome the differing performance characteristics of parallel architectures, methods for the graphical adjustment of granularity are proposed and investigated, and an encompassing parallel design environment is presented.Engineering and Physical Sciences Research Council

Newcastle University eTheses

The JCilk multithreaded language

Author: Lee I-Ting Angelina
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2005
Field of study

Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.Includes bibliographical references (p. 103-107).JCilk is a Java-based multithreaded programming language which extends Java to provide a dynamic threading model. Specifically, JCilk imports Cilk's fork-join primitives spawn and sync into Java to provide procedure-call semantics for concurrent subcomputations. More importantly, JCilk integrates exception handling with multi-threading by defining semantics consistent with Java's existing semantics of exception handling. JCilk's strategy of integrating multithreading with Java's exception semantics yields some surprising semantic synergies. In particular, JCilk extends Java's exception semantics to allow exceptions to be passed from a spawned method to its parent in a natural way that obviates the need for Cilk's inlet and abort constructs. This extension is "faithful" in that it obeys Java's ordinary serial semantics when executed on a single processor. When executed in parallel, however, an exception thrown by a JCilk computation signals its sibling computations to abort, yielding a clean semantics in which only a single exception from the enclosing try block is handled. To minimize the complexity of reasoning about aborts, JCilk signals them "semisynchronously" so that abort signals do not interrupt ordinary serial code. Because JCilk uses Java's normal exception mechanism to propagate an abort throughout a subcomputation, the programmer can handle clean-up by simply catching a thrown CilkAbort exception. This thesis documents in detail the designed semantics, the linguistic decisions we made, and their justifications. This thesis also describes the structure of JCilk compiler and how it supports the exception semantics.(cont.) Specifically, the JCilk compiler performs a two-stage compilation process to support the continuation mechanism required by the runtime system's work-stealing algorithm. By performing static analysis, the compiler generates code to support the "catchlet" and "finallet" mechanisms for handling exceptions. The design of JCilk represents joint research with John S. Danaher and Charles E. Leiserson.by I-Ting Angelina Lee.S.M

CiteSeerX

DSpace@MIT

Automatic Code-Generation Techniques for Micro-Threaded RISC Architectures

Author: McGuiness J.
Publication venue
Publication date: 01/01/2006
Field of study

Submitted to the University of Hertfordshire in partial fulfillment of the requirements of the degree of Master of Science by research.There has been an ever-widening gap between processor and memory speeds, resulting in a 'memory wall' where the time for memory accesses dominates performance. To counter this, architectures that use many very small threads that allow multiple memory accesses to occur in parallel have been under investigation. Examples of these architectures are the CARE (Compiler Aided Reorder Engine) architecture, micro-threading architectures and cellular architectures, such as the IBM Cyclops family, implementing using processors-in-memory (PIM), which is the main architecture discussed in this thesis. PIM architectures achieve high performance by increasing the bandwidth of the processor to memory communication and reducing that latency, via the use of many processors physically close to the main memory. These massively parallel architectures may have sophisticated memory models, and I contend that there is an open question regarding what may be the ideal approach to implementing parallelism, via using many threads, from the programmer's perspective. Should the implementation be at language-level such as UPC, HPF or other language extensions, alternatively within the compiler using trace-scheduling? Or should it be at library-level, for example OpenMP or POSIX-threads? Or perhaps within the architecture, such as designs derived from data-flow architectures? In this thesis, DIMES (the Delaware Iterative Multiprocessor Emulation System), which is being developed by CAPSL at the University of Delaware, was used as a hardware evaluation tool for such cellular architectures. As the programing example, the author chose to use a threaded Mandelbrot-set generator with a work-stealing algorithm to evaluate the DIMES cthread programming model. This implementation was used to identify potential problems and issues that may occur when attempting to implement massive number of very short-lived threads

University of Hertfordshire Research Archive

Supporting high-level, high-performance parallel programming with library-driven optimization

Author: Rodrigues Christopher
Publication venue
Publication date: 01/05/2014
Field of study

Parallel programming is a demanding task for developers partly because achieving scalable parallel speedup requires drawing upon a repertoire of complex, algorithm-specific, architecture-aware programming techniques. Ideally, developers of programming tools would be able to build algorithm-specific, high-level programming interfaces that hide the complex architecture-aware details. However, it is a monumental undertaking to develop such tools from scratch, and it is challenging to provide reusable functionality for developing such tools without sacrificing the hosted interface’s performance or ease of use. In particular, to get high performance on a cluster of multicore computers without requiring developers to manually place data and computation onto processors, it is necessary to combine prior methods for shared memory parallelism with new methods for algorithm-aware distribution of computation and data across the cluster. This dissertation presents Triolet, a programming language and compiler for high-level programming of parallel loops for high-performance execution on clusters of multicore computers. Triolet adopts a simple, familiar programming interface based on traversing collections of data. By incorporating semantic knowledge of how traversals behave, Triolet achieves efficient parallel execution and communication. Moreover, Triolet’s performance on sequential loops is comparable to that of low-level C code, ranging from seven percent slower to 2.8× slower on tested benchmarks. Triolet’s design demonstrates that it is possible to decouple the design of a compiler from the implementation of parallelism without sacrificing performance or ease of use: parallel and sequential loops are implemented as library code and compiled to efficient code by an optimizing compiler that is unaware of parallelism beyond the scope of a single thread. All handling of parallel work partitioning, data partitioning, and scheduling is embodied in library code. During compilation, library code is inlined into a program and specialized to yield customized parallel loops. Experimental results from a 128-core cluster (with 8 nodes and 16 cores per node) show that loops in Triolet outperform loops in Eden, a similar high-level language. Triolet achieves significant parallel speedup over sequential C code, with performance ranging from slightly faster to 4.3× slower than manually parallelized C code on compute-intensive loops. Thus, Triolet demonstrates that a library of container traversal functions can deliver cluster-parallel performance comparable to manually parallelized C code without requiring programmers to manage parallelism. This programming approach opens the potential for future research into parallel programming frameworks

Illinois Digital Environment for Access to Learning and Scholarship Repository

An Efficient Execution Model for Reactive Stream Programs

Author: Nguyen Vu Thien Nga
Publication venue
Publication date: 25/08/2015
Field of study

Stream programming is a paradigm where a program is structured by a set of computational nodes connected by streams. Focusing on data moving between computational nodes via streams, this programming model fits well for applications that process long sequences of data. We call such applications reactive stream programs (RSPs) to distinguish them from stream programs with rather small and finite input data. In stream programming, concurrency is expressed implicitly via communication streams. This helps to reduce the complexity of parallel programming. For this reason, stream programming has gained popularity as a programming model for parallel platforms. However, it is also challenging to analyse and improve the performance without an understanding of the program's internal behaviour. This thesis targets an effi cient execution model for deploying RSPs on parallel platforms. This execution model includes a monitoring framework to understand the internal behaviour of RSPs, scheduling strategies for RSPs on uniform shared-memory platforms; and mapping techniques for deploying RSPs on heterogeneous distributed platforms. The foundation of the execution model is based on a study of the performance of RSPs in terms of throughput and latency. This study includes quantitative formulae for throughput and latency; and the identification of factors that influence these performance metrics. Based on the study of RSP performance, this thesis exploits characteristics of RSPs to derive effective scheduling strategies on uniform shared-memory platforms. Aiming to optimise both throughput and latency, these scheduling strategies are implemented in two heuristic-based schedulers. Both of them are designed to be centralised to provide load balancing for RSPs with dynamic behaviour as well as dynamic structures. The first one uses the notion of positive and negative data demands on each stream to determine the scheduling priorities. This scheduler is independent from the runtime system. The second one requires the runtime system to provide the position information for each computational node in the RSP; and uses that to decide the scheduling priorities. Our experiments show that both schedulers provides similar performance while being significantly better than a reference implementation without dynamic load balancing. Also based on the study of RSP performance, we present in this thesis two new heuristic partitioning algorithms which are used to map RSPs onto heterogeneous distributed platforms. These are Kernighan-Lin Adaptation (KLA) and Congestion Avoidance (CA), where the main objective is to optimise the throughput. This is a multi-parameter optimisation problem where existing graph partitioning algorithms are not applicable. Compared to the generic meta-heuristic Simulated Annealing algorithm, both proposed algorithms achieve equally good or better results. KLA is faster for small benchmarks while slower for large ones. In contrast, CA is always orders of magnitudes faster even for very large benchmarks

University of Hertfordshire Research Archive

Recommended from our members

Allocation of SISAL program graphs to processors using BLAS

Author: Raisinghani Manoj H.
Publication venue: 'Oregon State University'
Publication date
Field of study

There are a number of well known techniques for extracting parallelism from a given program. They range from hardware implementations, building restructuring compilers or reorganizing of programs so as to specify all the available parallelism. The success rate of any of the known techniques is rather poor over all types of programs. This has pushed the research community to explore new languages and design different architectures to exploit program parallelism. The principles of dataflow architectures have addressed the problem of exploiting parallelism in systems by executing dataflow graphs. These graphs or programs represent data dependencies among instructions and execution of the graph proceeds in a data-driven manner. That is, an instruction is executed as soon as all its operands are available, without waiting for any program counter to sequence its execution, as is the case in conventional von Neumann architectures. In this thesis, data flow graphs are generated during the intermediate compilation of a functional language called SISAL (Streams and Iterations in a Single Assignment Language). The Intermediate Form (IFl) is a graphical language consisting of multiple acyclic function graphs that represent a given program. Each graph consists of a sequence of nodes and edges. The nodes specify the operation and the edges indicate the dependencies between the nodes. The graphs are further connected to each other by means of implicit dependencies. The Automator package developed in this project, preprocesses these multiple IF1 graphs and translates them into a single connected graph. It converts all implicit dependencies into actual ones. Additionally, complex language constructs like For All, loops and if-then-else are treated in special ways together with their nested levels by the Automator. There is virtually no limit to the number of nested levels that can be translated by this package. The Automator's prime contribution is in translating real programs written in SISAL into a specified format required by an allocation algorithm called the Balanced Layered Allocation Scheme (BLAS). BLAS partitions a connected graph into independent tasks and assigns them to processors in a multicomputer system. The problem of program allocation lies in maximizing parallelism while minimizing interprocessor communication costs. Hence, allocation is based on the best choice of communication to execution ratio for each task. BLAS utilizes heuristic rules to find a balance between computation and communication costs in the target system. Here the target architecture is a simulated nCUBE 3E computer, having a hypercube topology. Simulations show that, BLAS is effective in reducing the overall execution time of a program by considering the communication costs on the execution times. The results will help in understanding the effects in packing nodes (grain-packing), routing issues in the network and in general, the allocation problem to any processor in a network. In addition, tasks have also been assigned to adjacent processors only, instead of any processor on the hypercube network. The adjacent allocation to processors helps to determine trade-offs required between achieved speed-ups and the time it takes to completely allocate large graphs on compilation

ScholarsArchive@OSU

Polyhedral+Dataflow Graphs

Author: Davis Eddie C.
Publication venue: 'IUScholarWorks'
Publication date: 01/05/2020
Field of study

This research presents an intermediate compiler representation that is designed for optimization, and emphasizes the temporary storage requirements and execution schedule of a given computation to guide optimization decisions. The representation is expressed as a dataflow graph that describes computational statements and data mappings within the polyhedral compilation model. The targeted applications include both the regular and irregular scientific domains. The intermediate representation can be integrated into existing compiler infrastructures. A specification language implemented as a domain specific language in C++ describes the graph components and the transformations that can be applied. The visual representation allows users to reason about optimizations. Graph variants can be translated into source code or other representation. The language, intermediate representation, and associated transformations have been applied to improve the performance of differential equation solvers, or sparse matrix operations, tensor decomposition, and structured multigrid methods

Boise State University - ScholarWorks

Silkroad : A system supporting DSM and multiple paradigms in cluster computing

Author: PENG LIANG
Publication venue
Publication date: 29/05/2004
Field of study

Ph.DDOCTOR OF PHILOSOPH

ScholarBank@NUS