39 research outputs found

    Data-parallel concurrent constraint programming.

    Get PDF
    by Bo-ming Tong.Thesis (M.Phil.)--Chinese University of Hong Kong, 1994.Includes bibliographical references (leaves 104-[110]).Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Concurrent Constraint Programming --- p.2Chapter 1.2 --- Finite Domain Constraints --- p.3Chapter 2 --- The Firebird Language --- p.5Chapter 2.1 --- Finite Domain Constraints --- p.6Chapter 2.2 --- The Firebird Computation Model --- p.6Chapter 2.3 --- Miscellaneous Features --- p.7Chapter 2.4 --- Clause-Based N on determinism --- p.9Chapter 2.5 --- Programming Examples --- p.10Chapter 2.5.1 --- Magic Series --- p.10Chapter 2.5.2 --- Weak Queens --- p.14Chapter 3 --- Operational Semantics --- p.15Chapter 3.1 --- The Firebird Computation Model --- p.16Chapter 3.2 --- The Firebird Commit Law --- p.17Chapter 3.3 --- Derivation --- p.17Chapter 3.4 --- Correctness of Firebird Computation Model --- p.18Chapter 4 --- Exploitation of Data-Parallelism in Firebird --- p.24Chapter 4.1 --- An Illustrative Example --- p.25Chapter 4.2 --- Mapping Partitions to Processor Elements --- p.26Chapter 4.3 --- Masks --- p.27Chapter 4.4 --- Control Strategy --- p.27Chapter 4.4.1 --- A Control Strategy Suitable for Linear Equations --- p.28Chapter 5 --- Data-Parallel Abstract Machine --- p.30Chapter 5.1 --- Basic DPAM --- p.31Chapter 5.1.1 --- Hardware Requirements --- p.31Chapter 5.1.2 --- Procedure Calling Convention And Process Creation --- p.32Chapter 5.1.3 --- Memory Model --- p.34Chapter 5.1.4 --- Registers --- p.41Chapter 5.1.5 --- Process Management --- p.41Chapter 5.1.6 --- Unification --- p.49Chapter 5.1.7 --- Variable Table --- p.49Chapter 5.2 --- DPAM with Backtracking --- p.50Chapter 5.2.1 --- Choice Point --- p.52Chapter 5.2.2 --- Trailing --- p.52Chapter 5.2.3 --- Recovering the Process Queues --- p.57Chapter 6 --- Implementation --- p.58Chapter 6.1 --- The DECmpp Massively Parallel Computer --- p.58Chapter 6.2 --- Implementation Overview --- p.59Chapter 6.3 --- Constraints --- p.60Chapter 6.3.1 --- Breaking Down Equality Constraints --- p.61Chapter 6.3.2 --- Processing the Constraint 'As Is' --- p.62Chapter 6.4 --- The Wide-Tag Architecture --- p.63Chapter 6.5 --- Register Window --- p.64Chapter 6.6 --- Dereferencing --- p.65Chapter 6.7 --- Output --- p.66Chapter 6.7.1 --- Collecting the Solutions --- p.66Chapter 6.7.2 --- Decoding the solution --- p.68Chapter 7 --- Performance --- p.69Chapter 7.1 --- Uniprocessor Performance --- p.71Chapter 7.2 --- Solitary Mode --- p.73Chapter 7.3 --- Bit Vectors of Domain Variables --- p.75Chapter 7.4 --- Heap Consumption of the Heap Frame Scheme --- p.77Chapter 7.5 --- Eager Nondeterministic Derivation vs Lazy Nondeterministic Deriva- tion --- p.78Chapter 7.6 --- Priority Scheduling --- p.79Chapter 7.7 --- Execution Profile --- p.80Chapter 7.8 --- Effect of the Number of Processor Elements on Performance --- p.82Chapter 7.9 --- Change of the Degree of Parallelism During Execution --- p.84Chapter 8 --- Related Work --- p.88Chapter 8.1 --- Vectorization of Prolog --- p.89Chapter 8.2 --- Parallel Clause Matching --- p.90Chapter 8.3 --- Parallel Interpreter --- p.90Chapter 8.4 --- Bounded Quantifications --- p.91Chapter 8.5 --- SIMD MultiLog --- p.91Chapter 9 --- Conclusion --- p.93Chapter 9.1 --- Limitations --- p.94Chapter 9.1.1 --- Data-Parallel Firebird is Specialized --- p.94Chapter 9.1.2 --- Limitations of the Implementation Scheme --- p.95Chapter 9.2 --- Future Work --- p.95Chapter 9.2.1 --- Extending Firebird --- p.95Chapter 9.2.2 --- Improvements Specific to DECmpp --- p.99Chapter 9.2.3 --- Labeling --- p.100Chapter 9.2.4 --- Parallel Domain Consistency --- p.101Chapter 9.2.5 --- Branch and Bound Algorithm --- p.102Chapter 9.2.6 --- Other Possible Future Work --- p.102Bibliography --- p.10

    Reliable scalable symbolic computation: The design of SymGridPar2

    Get PDF
    Symbolic computation is an important area of both Mathematics and Computer Science, with many large computations that would benefit from parallel execution. Symbolic computations are, however, challenging to parallelise as they have complex data and control structures, and both dynamic and highly irregular parallelism. The SymGridPar framework (SGP) has been developed to address these challenges on small-scale parallel architectures. However the multicore revolution means that the number of cores and the number of failures are growing exponentially, and that the communication topology is becoming increasingly complex. Hence an improved parallel symbolic computation framework is required. This paper presents the design and initial evaluation of SymGridPar2 (SGP2), a successor to SymGridPar that is designed to provide scalability onto 10^5 cores, and hence also provide fault tolerance. We present the SGP2 design goals, principles and architecture. We describe how scalability is achieved using layering and by allowing the programmer to control task placement. We outline how fault tolerance is provided by supervising remote computations, and outline higher-level fault tolerance abstractions. We describe the SGP2 implementation status and development plans. We report the scalability and efficiency, including weak scaling to about 32,000 cores, and investigate the overheads of tolerating faults for simple symbolic computations

    Reliable scalable symbolic computation: The design of SymGridPar2

    Get PDF
    Symbolic computation is an important area of both Mathematics and Computer Science, with many large computations that would benefit from parallel execution. Symbolic computations are, however, challenging to parallelise as they have complex data and control structures, and both dynamic and highly irregular parallelism. The SymGridPar framework (SGP) has been developed to address these challenges on small-scale parallel architectures. However the multicore revolution means that the number of cores and the number of failures are growing exponentially, and that the communication topology is becoming increasingly complex. Hence an improved parallel symbolic computation framework is required. This paper presents the design and initial evaluation of SymGridPar2 (SGP2), a successor to SymGridPar that is designed to provide scalability onto 10^5 cores, and hence also provide fault tolerance. We present the SGP2 design goals, principles and architecture. We describe how scalability is achieved using layering and by allowing the programmer to control task placement. We outline how fault tolerance is provided by supervising remote computations, and outline higher-level fault tolerance abstractions. We describe the SGP2 implementation status and development plans. We report the scalability and efficiency, including weak scaling to about 32,000 cores, and investigate the overheads of tolerating faults for simple symbolic computations

    GUMSMP: a scalable parallel Haskell implementation

    Get PDF
    The most widely available high performance platforms today are hierarchical, with shared memory leaves, e.g. clusters of multi-cores, or NUMA with multiple regions. The Glasgow Haskell Compiler (GHC) provides a number of parallel Haskell implementations targeting different parallel architectures. In particular, GHC-SMP supports shared memory architectures, and GHC-GUM supports distributed memory machines. Both implementations use different, but related, runtime system (RTS) mechanisms and achieve good performance. A specialised RTS for the ubiquitous hierarchical architectures is lacking. This thesis presents the design, implementation, and evaluation of a new parallel Haskell RTS, GUMSMP, that combines shared and distributed memory mechanisms to exploit hierarchical architectures more effectively. The design evaluates a variety of design choices and aims to efficiently combine scalable distributed memory parallelism, using a virtual shared heap over a hierarchical architecture, with low-overhead shared memory parallelism on shared memory nodes. Key design objectives in realising this system are to prefer local work, and to exploit mostly passive load distribution with pre-fetching. Systematic performance evaluation shows that the automatic hierarchical load distribution policies must be carefully tuned to obtain good performance. We investigate the impact of several policies including work pre-fetching, favouring inter-node work distribution, and spark segregation with different export and select policies. We present the performance results for GUMSMP, demonstrating good scalability for a set of benchmarks on up to 300 cores. Moreover, our policies provide performance improvements of up to a factor of 1.5 compared to GHC- GUM. The thesis provides a performance evaluation of distributed and shared heap implementations of parallel Haskell on a state-of-the-art physical shared memory NUMA machine. The evaluation exposes bottlenecks in memory management, which limit scalability beyond 25 cores. We demonstrate that GUMSMP, that combines both distributed and shared heap abstractions, consistently outper- forms the shared memory GHC-SMP on seven benchmarks by a factor of 3.3 on average. Specifically, we show that the best results are obtained when shar- ing memory only within a single NUMA region, and using distributed memory system abstractions across the regions

    Adaptive architecture-transparent policy control in a distributed graph reducer

    Get PDF
    The end of the frequency scaling era occured around 2005 as the clock frequency has stalled for commodity architectures. Thus performance improvements that could in the past be expected with each new hardware generation needed to originate elsewhere. Almost all computer architectures exhibit substantial and growing levels of parallelism, exploiting which became one of the key sources of performance and scalability improvements. Alas, parallel programming proved much more difficult than sequential, due to the need to specify coordination and parallelism management aspects. Whilst low-level languages place the burden on the programmers reducing productivity and portability, semi-implicit approaches delegate the responsibility to sophisticated compilers and run-time systems. This thesis presents a study of adaptive load distribution based on work stealing using history and ancestry information in a distributed graph reducer for a nonstrict functional language. The results contribute to the exploration of more flexible run-time-system-level parallelism control implementing a semi-explicit model of parallelism, which offers productivity and high level of abstraction by delegating the responsibility of coordination to the run-time system. After characterising a set of parallel functional applications, we study the use of historical information to adapt the choice of the victim to steal from in a work stealing scheduler. We observe substantially lower numbers of messages for data-parallel and nested applications. However, this heuristic fails in cases where past application behaviour is not resembling future behaviour, for instance for Divide-&-Conquer applications with a large number of very fine-grained threads and generators of parallelism that move dynamically across processing elements. This mechanism is not specific to the language and the run-time system, and applies to other work stealing schedulers. Next, we focus on the other key work stealing decision of which sparks that represent potential parallelism to donate, investigating the effect of Spark Colocation on the performance of five Divide-&-Conquer programs run on a cluster of up to 256 PEs. When using Spark Colocation, the distributed graph reducer shares related work resulting in a higher degree of both potential and actual parallelism, and more fine-grained and less variable thread size. We validate this behaviour by observing a reduction in average fetch times, but increased amounts of FETCH messages and of inter-PE pointers for colocation, which nevertheless results in improved load balance for three of the five benchmark programs. The results show high speedups and speedup improvements for Spark Colocation for the three more regular and nested applications and performance degradation for two programs: one that is excessively fine-grained and one exhibiting limited scalability. Overall, Spark Colocation appears most beneficial for higher numbers of PEs, where improved load balance and higher degree of parallelism have more opportunities to pay off. In more general terms, we show that a run-time system can beneficially use historical information on past stealing successes that is gathered dynamically and used within the same run and the ancestry information dynamically reconstructed at run time using annotations. Moreover, the results support the view that different heuristics are beneficial for applications using different parallelism patterns, underlining the advantages of a flexible architecture-transparent approach.The Scottish Informatics and Computer Science Alliance (SICSA

    A parallel functional language compiler for message-passing multicomputers

    Get PDF
    The research presented in this thesis is about the design and implementation of Naira, a parallel, parallelising compiler for a rich, purely functional programming language. The source language of the compiler is a subset of Haskell 1.2. The front end of Naira is written entirely in the Haskell subset being compiled. Naira has been successfully parallelised and it is the largest successfully parallelised Haskell program having achieved good absolute speedups on a network of SUN workstations. Having the same basic structure as other production compilers of functional languages, Naira's parallelisation technology should carry forward to other functional language compilers. The back end of Naira is written in C and generates parallel code in the C language which is envisioned to be run on distributed-memory machines. The code generator is based on a novel compilation scheme specified using a restricted form of Milner's 7r-calculus which achieves asynchronous communication. We present the first working implementation of this scheme on distributed-memory message-passing multicomputers with split-phase transactions. Simulated assessment of the generated parallel code indicates good parallel behaviour. Parallelism is introduced using explicit, advisory user annotations in the source' program and there are two major aspects of the use of annotations in the compiler. First, the front end of the compiler is parallelised so as to improve its efficiency at compilation time when it is compiling input programs. Secondly, the input programs to the compiler can themselves contain annotations based on which the compiler generates the multi-threaded parallel code. These, therefore, make Naira, unusually and uniquely, both a parallel and a parallelising compiler. We adopt a medium-grained approach to granularity where function applications form the unit of parallelism and load distribution. We have experimented with two different task distribution strategies, deterministic and random, and have also experimented with thread-based and quantum- based scheduling policies. Our experiments show that there is little efficiency difference for regular programs but the quantum-based scheduler is the best in programs with irregular parallelism. The compiler has been successfully built, parallelised and assessed using both idealised and realistic measurement tools: we obtained significant compilation speed-ups on a variety of simulated parallel architectures. The simulated results are supported by the best results obtained on real hardware for such a large program: we measured an absolute speedup of 2.5 on a network of 5 SUN workstations. The compiler has also been shown to have good parallelising potential, based on popular test programs. Results of assessing Naira's generated unoptimised parallel code are comparable to those produced by other successful parallel implementation projects

    Granularity in Large-Scale Parallel Functional Programming

    Get PDF
    This thesis demonstrates how to reduce the runtime of large non-strict functional programs using parallel evaluation. The parallelisation of several programs shows the importance of granularity, i.e. the computation costs of program expressions. The aspect of granularity is studied both on a practical level, by presenting and measuring runtime granularity improvement mechanisms, and at a more formal level, by devising a static granularity analysis. By parallelising several large functional programs this thesis demonstrates for the first time the advantages of combining lazy and parallel evaluation on a large scale: laziness aids modularity, while parallelism reduces runtime. One of the parallel programs is the Lolita system which, with more than 47,000 lines of code, is the largest existing parallel non-strict functional program. A new mechanism for parallel programming, evaluation strategies, to which this thesis contributes, is shown to be useful in this parallelisation. Evaluation strategies simplify parallel programming by separating algorithmic code from code specifying dynamic behaviour. For large programs the abstraction provided by functions is maintained by using a data-oriented style of parallelism, which defines parallelism over intermediate data structures rather than inside the functions. A highly parameterised simulator, GRANSIM, has been constructed collaboratively and is discussed in detail in this thesis. GRANSIM is a tool for architecture-independent parallelisation and a testbed for implementing runtime-system features of the parallel graph reduction model. By providing an idealised as well as an accurate model of the underlying parallel machine, GRANSIM has proven to be an essential part of an integrated parallel software engineering environment. Several parallel runtime- system features, such as granularity improvement mechanisms, have been tested via GRANSIM. It is publicly available and in active use at several universities worldwide. In order to provide granularity information this thesis presents an inference-based static granularity analysis. This analysis combines two existing analyses, one for cost and one for size information. It determines an upper bound for the computation costs of evaluating an expression in a simple strict higher-order language. By exposing recurrences during cost reconstruction and using a library of recurrences and their closed forms, it is possible to infer the costs for some recursive functions. The possible performance improvements are assessed by measuring the parallel performance of a hand-analysed and annotated program
    corecore