115,518 research outputs found

    Scheduling Macro-DataFlow Programs on Task-Parallel Runtime Systems

    Get PDF
    Though multicore systems are ubiquitous, parallel programming models for these systems are generally not accessible to a wide programmer community. The macro-dataflow model is an attractive stepping stone to implicit parallelism for domain experts who are not the target audience for explicit parallel programming models. We use Intel's Concurrent Collections (CnC) programming model as a concrete exemplar of the macro-dataflow model in this work. CnC is a high level coordination language that can be implemented on top of lower-level task-parallel frameworks. In this thesis, we study an implementation of CnC, based on Habanero-Java as the underlying task-parallel runtime system. A unique feature of CnC, first-class decoupling of data and control dependences, allows us to experiment with schedulers by taking these data and control dependences into account for better scheduling decisions. Our observations led to the proposal and implementation of a new task-parallel synchronization construct for Habanero-Java, namely Data-Driven Futures. We obtained two kinds of experimental results from our implementation. First, we compare the effectiveness of task scheduling policies for CnC programs. Secondly, we show that data-driven futures not only reduce execution time but also shrink memory footprint. In summary, this thesis shows a macro-dataflow programming model can deliver productivity and performance on modern multicore processors

    Analysis, classification and comparison of scheduling techniques for software transactional memories

    Get PDF
    Transactional Memory (TM) is a practical programming paradigm for developing concurrent applications. Performance is a critical factor for TM implementations, and various studies demonstrated that specialised transaction/thread scheduling support is essential for implementing performance-effective TM systems. After one decade of research, this article reviews the wide variety of scheduling techniques proposed for Software Transactional Memories. Based on peculiarities and differences of the adopted scheduling strategies, we propose a classification of the existing techniques, and we discuss the specific characteristics of each technique. Also, we analyse the results of previous evaluation and comparison studies, and we present the results of a new experimental study encompassing techniques based on different scheduling strategies. Finally, we identify potential strengths and weaknesses of the different techniques, as well as the issues that require to be further investigated

    Enabling Fine-Grain Restricted Coset Coding Through Word-Level Compression for PCM

    Full text link
    Phase change memory (PCM) has recently emerged as a promising technology to meet the fast growing demand for large capacity memory in computer systems, replacing DRAM that is impeded by physical limitations. Multi-level cell (MLC) PCM offers high density with low per-byte fabrication cost. However, despite many advantages, such as scalability and low leakage, the energy for programming intermediate states is considerably larger than programing single-level cell PCM. In this paper, we study encoding techniques to reduce write energy for MLC PCM when the encoding granularity is lowered below the typical cache line size. We observe that encoding data blocks at small granularity to reduce write energy actually increases the write energy because of the auxiliary encoding bits. We mitigate this adverse effect by 1) designing suitable codeword mappings that use fewer auxiliary bits and 2) proposing a new Word-Level Compression (WLC) which compresses more than 91% of the memory lines and provides enough room to store the auxiliary data using a novel restricted coset encoding applied at small data block granularities. Experimental results show that the proposed encoding at 16-bit data granularity reduces the write energy by 39%, on average, versus the leading encoding approach for write energy reduction. Furthermore, it improves endurance by 20% and is more reliable than the leading approach. Hardware synthesis evaluation shows that the proposed encoding can be implemented on-chip with only a nominal area overhead.Comment: 12 page

    Porting Decision Tree Algorithms to Multicore using FastFlow

    Full text link
    The whole computer hardware industry embraced multicores. For these machines, the extreme optimisation of sequential algorithms is no longer sufficient to squeeze the real machine power, which can be only exploited via thread-level parallelism. Decision tree algorithms exhibit natural concurrency that makes them suitable to be parallelised. This paper presents an approach for easy-yet-efficient porting of an implementation of the C4.5 algorithm on multicores. The parallel porting requires minimal changes to the original sequential code, and it is able to exploit up to 7X speedup on an Intel dual-quad core machine.Comment: 18 pages + cove

    Using shared-data localization to reduce the cost of inspector-execution in unified-parallel-C programs

    Get PDF
    Programs written in the Unified Parallel C (UPC) language can access any location of the entire local and remote address space via read/write operations. However, UPC programs that contain fine-grained shared accesses can exhibit performance degradation. One solution is to use the inspector-executor technique to coalesce fine-grained shared accesses to larger remote access operations. A straightforward implementation of the inspector executor transformation results in excessive instrumentation that hinders performance.; This paper addresses this issue and introduces various techniques that aim at reducing the generated instrumentation code: a shared-data localization transformation based on Constant-Stride Linear Memory Descriptors (CSLMADs) [S. Aarseth, Gravitational N-Body Simulations: Tools and Algorithms, Cambridge Monographs on Mathematical Physics, Cambridge University Press, 2003.], the inlining of data locality checks and the usage of an index vector to aggregate the data. Finally, the paper introduces a lightweight loop code motion transformation to privatize shared scalars that were propagated through the loop body.; A performance evaluation, using up to 2048 cores of a POWER 775, explores the impact of each optimization and characterizes the overheads of UPC programs. It also shows that the presented optimizations increase performance of UPC programs up to 1.8 x their UPC hand-optimized counterpart for applications with regular accesses and up to 6.3 x for applications with irregular accesses.Peer ReviewedPostprint (author's final draft

    A Map-Reduce Parallel Approach to Automatic Synthesis of Control Software

    Full text link
    Many Control Systems are indeed Software Based Control Systems, i.e. control systems whose controller consists of control software running on a microcontroller device. This motivates investigation on Formal Model Based Design approaches for automatic synthesis of control software. Available algorithms and tools (e.g., QKS) may require weeks or even months of computation to synthesize control software for large-size systems. This motivates search for parallel algorithms for control software synthesis. In this paper, we present a Map-Reduce style parallel algorithm for control software synthesis when the controlled system (plant) is modeled as discrete time linear hybrid system. Furthermore we present an MPI-based implementation PQKS of our algorithm. To the best of our knowledge, this is the first parallel approach for control software synthesis. We experimentally show effectiveness of PQKS on two classical control synthesis problems: the inverted pendulum and the multi-input buck DC/DC converter. Experiments show that PQKS efficiency is above 65%. As an example, PQKS requires about 16 hours to complete the synthesis of control software for the pendulum on a cluster with 60 processors, instead of the 25 days needed by the sequential algorithm in QKS.Comment: To be submitted to TACAS 2013. arXiv admin note: substantial text overlap with arXiv:1207.4474, arXiv:1207.409

    RELEASE: A High-level Paradigm for Reliable Large-scale Server Software

    Get PDF
    Erlang is a functional language with a much-emulated model for building reliable distributed systems. This paper outlines the RELEASE project, and describes the progress in the rst six months. The project aim is to scale the Erlang's radical concurrency-oriented programming paradigm to build reliable general-purpose software, such as server-based systems, on massively parallel machines. Currently Erlang has inherently scalable computation and reliability models, but in practice scalability is constrained by aspects of the language and virtual machine. We are working at three levels to address these challenges: evolving the Erlang virtual machine so that it can work effectively on large scale multicore systems; evolving the language to Scalable Distributed (SD) Erlang; developing a scalable Erlang infrastructure to integrate multiple, heterogeneous clusters. We are also developing state of the art tools that allow programmers to understand the behaviour of massively parallel SD Erlang programs. We will demonstrate the e ectiveness of the RELEASE approach using demonstrators and two large case studies on a Blue Gene
    corecore