141 research outputs found

    Performance driven distributed scheduling of parallel hybrid computations

    Get PDF
    AbstractExascale computing is fast becoming a mainstream research area. In order to realize exascale performance, it is necessary to have efficient scheduling of large parallel computations with scalable performance on a large number of cores/processors. The scheduler needs to execute in a pure distributed and online fashion, should follow affinity inherent in the computation and must have low time and message complexity. Further, it should also avoid physical deadlocks due to bounded resources including space/memory per core. Simultaneous consideration of these factors makes affinity driven distributed scheduling particularly challenging. We attempt to address this challenge for hybrid parallel computations which contain tasks that have pre-specified affinity to a place and also tasks that can be mapped to any place in the system. Specifically, we address two scheduling problems of the type Pm|Mj,prec|Cmax. This paper presents online distributed scheduling algorithms for hybrid parallel computations assuming both unconstrained and bounded space per place. We also present the time and message complexity for distributed scheduling of hybrid computations. To the best of our knowledge, this is the first time that distributed scheduling algorithms for hybrid parallel computations have been presented and analyzed for time and message bounds under both unconstrained space and bounded space

    Static Analysis of OpenStream Programs

    Get PDF
    International audienceThis paper studies the applicability of polyhedral techniques to the parallel language Open- Stream [25]. When applicable, polyhedral techniques are invaluable for compile-time debugging and for generating efficient code well suited to a target architecture. OpenStream is a two-level language in which a control program directs the initialization of parallel task instances that communicate through streams, with possibly multiple writers and readers. It has a fairly complex semantics in its most general setting, but we restrict ourselves to the case where the control program is sequential, which is representative of the majority of the OpenStream applications. This restriction offers deterministic concurrency by construction, but deadlocks are still possible. We show that, if the control program is polyhedral, one may statically compute, for each task instance, the read and write indices to each of its streams, and thus reason statically about the dependences among task instances (the only scheduling constraints in this polyhedral subset). These indices may be polynomials of arbitrary degree, thus requiring to extend to polynomials the standard polyhedral techniques for dependence analysis, scheduling, and deadlock detection. Modern SMT allow to solve polynomial problems, albeit with no guarantee of success; the approach of Feautrier [10] may offer an alternative solution. We also establish two important results related to deadlocks in OpenStream: 1) a characterization of deadlocks in terms of dependence paths, which implies that streams can be safely bounded as soon as a schedule exists with such sizes, 2) the proof that deadlock detection is undecidable, even for polyhedral OpenStream

    Scalable data abstractions for distributed parallel computations

    Get PDF
    The ability to express a program as a hierarchical composition of parts is an essential tool in managing the complexity of software and a key abstraction this provides is to separate the representation of data from the computation. Many current parallel programming models use a shared memory model to provide data abstraction but this doesn't scale well with large numbers of cores due to non-determinism and access latency. This paper proposes a simple programming model that allows scalable parallel programs to be expressed with distributed representations of data and it provides the programmer with the flexibility to employ shared or distributed styles of data-parallelism where applicable. It is capable of an efficient implementation, and with the provision of a small set of primitive capabilities in the hardware, it can be compiled to operate directly on the hardware, in the same way stack-based allocation operates for subroutines in sequential machines

    Bounded Stream Scheduling in Polyhedral OpenStream

    Get PDF
    International audienceWe consider OpenStream, a streaming dataflow language which supports the specification of concurrent tasks that communicate through streams. Streams, in the spirit of classical process networks, have no restrictions on their size. In order to deploy an OpenStream program on a chip, however, the size of the streams has to be bounded. This constricts the range of runtime behavior by restricting the schedules to a subset of parallel executions where the required memory never surpasses the available resources. In this paper we exploit an approach that, conservatively, certifies that augmenting the intrinsic dataflow dependencies of the program with stream bounding constraints does not deadlock the program: it cannot show the existence of a deadlock but can give a certificate for the absence thereof. The aim of this work is to study the limitations of this stream bounding strategy and to demonstrate how it can currently be used to determine if an OpenStream program can execute under the particular memory constraints of a given architecture

    Compiling SHIM

    Get PDF
    Embedded systems demand concurrency for supporting simultaneous actions in their environment and parallel hardware. Although most concurrent programming formalisms are prone to races and non-determinism, some, such as our SHIM (software/hardware integration medium) language, avoid them by design. In particular, the behavior of SHIM programs is scheduling-independent, meaning the I/O behavior of a program is independent of scheduling policies, including the relative execution rates of concurrent processes. The SHIM project demonstrates how a scheduling-independent language simplifies the design, optimization, and verification of concurrent systems. Through examples and discussion, we describe the SHIM language and code generation techniques for both shared-memory and message-passing architectures, along with some verification algorithms

    Function Shipping in a Scalable Parallel Programming Model

    Get PDF
    Increasingly, a large number of scientific and technical applications exhibit dynamically generated parallelism or irregular data access patterns. These applications pose significant challenges to achieving scalable performance on large scale parallel systems. This thesis explores the advantages of using function shipping as a language level primitive to help simplify writing scalable irregular and dynamic parallel applications. Function shipping provides a mechanism to avoid exposing latency, by enabling users ship data and computation together to a remote worker for execution. In the context of the Coarray Fortran 2.0 Partitioned Global Address Space language, we implement function shipping and the finish synchronization construct, which ensures global completion of a set of shipped function instances. We demonstrate the usability and performance benefits of using function shipping with several benchmarks. Experiments on emerging supercomputers show that function shipping is useful and effective in achieving scalable performance with dynamic and irregular algorithms

    X10 vs Java: Concurrency Constructs and Performance

    Get PDF
    To avoid overheating the chip, chip designers have switched to multi-cores. While multicore CPUs reserve instruction-level parallelism features that help existing applications run as if they were running under single core, applications do not reach speeds two or four times faster. Instead of relying on compiler and hardware to figure out parallelism in source code, software developers now must control parallelism explicitly in their programs. Many programming languages and libraries, such as Java, C# .NET, and OpenMP, are trying to help programmers by providing rich concurrency API. X10 is the new experimental language from IBM Research, which has been under development since 2004 targeting multi-core programming ranging from multi-cores single machine to cluster. This project examines the X10 parallel constructs, compares its usability with the Java language, the OpenMP library, and then compares the performance between X10 and Java language

    WSCOM: Online Task Scheduling with Data Transfers

    Get PDF
    International audienceThis paper considers the online problem of task scheduling with communication. All information on tasks and communication are not available in advance except the DAG of task topology. This situation is typically encountered when scheduling DAG of tasks corresponding to Make files executions. To tackle this problem, we introduce a new variation of the work-stealing algorithm: WSCOM. These algorithms take advantage of the knowledge of the DAG topology to cluster communicating tasks together and reduce the total number of communications. Several variants are designed to overlap communication or optimize the graph decomposition. Performance is evaluated by simulation and our algorithms are compared with off-line list-scheduling algorithms and classical work-stealing from the literature. Simulations are executed on both random graphs and a new trace archive of Make file DAG. These experiments validate the different design choices taken. In particular we show that WSCOM is able to achieve performance close to off-line algorithms in most cases and is even able to achieve better performance in the event of congestion due to less data transfer. Moreover WSCOM can achieve the same high performances as the classical work-stealing with up to ten times less bandwidth
    corecore