77 research outputs found

    Automatic skeleton-driven performance optimizations for transactional memory

    Get PDF
    The recent shift toward multi -core chips has pushed the burden of extracting performance to the programmer. In fact, programmers now have to be able to uncover more coarse -grain parallelism with every new generation of processors, or the performance of their applications will remain roughly the same or even degrade. Unfortunately, parallel programming is still hard and error prone. This has driven the development of many new parallel programming models that aim to make this process efficient.This thesis first combines the skeleton -based and transactional memory programming models in a new framework, called OpenSkel, in order to improve performance and programmability of parallel applications. This framework provides a single skeleton that allows the implementation of transactional worklist applications. Skeleton or pattern-based programming allows parallel programs to be expressed as specialized instances of generic communication and computation patterns. This leaves the programmer with only the implementation of the particular operations required to solve the problem at hand. Thus, this programming approach simplifies parallel programming by eliminating some of the major challenges of parallel programming, namely thread communication, scheduling and orchestration. However, the application programmer has still to correctly synchronize threads on data races. This commonly requires the use of locks to guarantee atomic access to shared data. In particular, lock programming is vulnerable to deadlocks and also limits coarse grain parallelism by blocking threads that could be potentially executed in parallel.Transactional Memory (TM) thus emerges as an attractive alternative model to simplify parallel programming by removing this burden of handling data races explicitly. This model allows programmers to write parallel code as transactions, which are then guaranteed by the runtime system to execute atomically and in isolation regardless of eventual data races. TM programming thus frees the application from deadlocks and enables the exploitation of coarse grain parallelism when transactions do not conflict very often. Nevertheless, thread management and orchestration are left for the application programmer. Fortunately, this can be naturally handled by a skeleton framework. This fact makes the combination of skeleton -based and transactional programming a natural step to improve programmability since these models complement each other. In fact, this combination releases the application programmer from dealing with thread management and data races, and also inherits the performance improvements of both models. In addition to it, a skeleton framework is also amenable to skeleton - driven iii performance optimizations that exploits the application pattern and system information.This thesis thus also presents a set of pattern- oriented optimizations that are automatically selected and applied in a significant subset of transactional memory applications that shares a common pattern called worklist. These optimizations exploit the knowledge about the worklist pattern and the TM nature of the applications to avoid transaction conflicts, to prefetch data, to reduce contention etc. Using a novel autotuning mechanism, OpenSkel dynamically selects the most suitable set of these patternoriented performance optimizations for each application and adjusts them accordingly. Experimental results on a subset of five applications from the STAMP benchmark suite show that the proposed autotuning mechanism can achieve performance improvements within 2 %, on average, of a static oracle for a 16 -core UMA (Uniform Memory Access) platform and surpasses it by 7% on average for a 32 -core NUMA (Non -Uniform Memory Access) platform.Finally, this thesis also investigates skeleton -driven system- oriented performance optimizations such as thread mapping and memory page allocation. In order to do it, the OpenSkel system and also the autotuning mechanism are extended to accommodate these optimizations. The conducted experimental results on a subset of five applications from the STAMP benchmark show that the OpenSkel framework with the extended autotuning mechanism driving both pattern and system- oriented optimizations can achieve performance improvements of up to 88 %, with an average of 46 %, over a baseline version for a 16 -core UMA platform and up to 162 %, with an average of 91 %, for a 32 -core NUMA platform

    Enhancing the efficiency and practicality of software transactional memory on massively multithreaded systems

    Get PDF
    Chip Multithreading (CMT) processors promise to deliver higher performance by running more than one stream of instructions in parallel. To exploit CMT's capabilities, programmers have to parallelize their applications, which is not a trivial task. Transactional Memory (TM) is one of parallel programming models that aims at simplifying synchronization by raising the level of abstraction between semantic atomicity and the means by which that atomicity is achieved. TM is a promising programming model but there are still important challenges that must be addressed to make it more practical and efficient in mainstream parallel programming. The first challenge addressed in this dissertation is that of making the evaluation of TM proposals more solid with realistic TM benchmarks and being able to run the same benchmarks on different STM systems. We first introduce a benchmark suite, RMS-TM, a comprehensive benchmark suite to evaluate HTMs and STMs. RMS-TM consists of seven applications from the Recognition, Mining and Synthesis (RMS) domain that are representative of future workloads. RMS-TM features current TM research issues such as nesting and I/O inside transactions, while also providing various TM characteristics. Most STM systems are implemented as user-level libraries: the programmer is expected to manually instrument not only transaction boundaries, but also individual loads and stores within transactions. This library-based approach is increasingly tedious and error prone and also makes it difficult to make reliable performance comparisons. To enable an "apples-to-apples" performance comparison, we then develop a software layer that allows researchers to test the same applications with interchangeable STM back ends. The second challenge addressed is that of enhancing performance and scalability of TM applications running on aggressive multi-core/multi-threaded processors. Performance and scalability of current TM designs, in particular STM desings, do not always meet the programmer's expectation, especially at scale. To overcome this limitation, we propose a new STM design, STM2, based on an assisted execution model in which time-consuming TM operations are offloaded to auxiliary threads while application threads optimistically perform computation. Surprisingly, our results show that STM2 provides, on average, speedups between 1.8x and 5.2x over state-of-the-art STM systems. On the other hand, we notice that assisted-execution systems may show low processor utilization. To alleviate this problem and to increase the efficiency of STM2, we enriched STM2 with a runtime mechanism that automatically and adaptively detects application and auxiliary threads' computing demands and dynamically partition hardware resources between the pair through the hardware thread prioritization mechanism implemented in POWER machines. The third challenge is to define a notion of what it means for a TM program to be correctly synchronized. The current definition of transactional data race requires all transactions to be totally ordered "as if'' serialized by a global lock, which limits the scalability of TM designs. To remove this constraint, we first propose to relax the current definition of transactional data race to allow a higher level of concurrency. Based on this definition we propose the first practical race detection algorithm for C/C++ applications (TRADE) and implement the corresponding race detection tool. Then, we introduce a new definition of transactional data race that is more intuitive, transparent to the underlying TM implementation, can be used for a broad set of C/C++ TM programs. Based on this new definition, we proposed T-Rex, an efficient and scalable race detection tool for C/C++ TM applications. Using TRADE and T-Rex, we have discovered subtle transactional data races in widely-used STAMP applications which have not been reported in the past

    Data-centric concurrency control on the java programming language

    Get PDF
    Dissertação para obtenção do Grau de Mestre em Engenharia InformáticaThe multi-core paradigm has propelled shared-memory concurrent programming to an important role in software development. Its use is however limited by the constructs that provide a layer of abstraction for synchronizing access to shared resources. Reasoning with these constructs is not trivial due to their concurrent nature. Data-races and deadlocks occur in concurrent programs, encumbering the programmer and further reducing his productivity. Even though the constructs should be as unobtrusive and intuitive as possible, performance must also be kept high compared to legacy lock-based mechanism. Failure to guarantee similar performance will hinder a system from adoption. Recent research attempts to address these issues. However, the current state of the art in concurrency control mechanisms is mostly code-centric and not intuitive. Its codecentric nature requires the specification of the zones in the code that require synchronization,contributing to the decentralization of concurrency bugs and error-proneness of the programmer. On the other hand, the only data-centric approach, AJ [VTD06], exposes excessive detail to the programmer and fails to provide complete deadlock-freedom. Given this state of the art, our proposal intends to provide the programmer a set of unobtrusive data-centric constructs. These will guarantee desirable security properties: composability, atomicity, and deadlock-freedom in all scenarios. For that purpose, a lower level mechanism (ResourceGroups) will be used. The model proposed resides on the known concept of atomic variables, the basis for our concurrency control mechanism. To infer the efficiency of our work, it is compared to Java synchronized blocks, transactional memory and AJ, where our system demonstrates a competitive performance and an equivalent level of expressivity.RepComp project(PTDC/EIA-EIA/108963/2008

    Efficient runtime systems for speculative parallelization

    Get PDF
    Manuelle Parallelisierung ist zeitaufwändig und fehleranfällig. Automatische Parallelisierung andererseits findet häufig nur einen Bruchteil der verfügbaren Parallelität. Mithilfe von Spekulation kann jedoch auch für komplexere Programme ein Großteil der Parallelität ausgenutzt werden. Spekulativ parallelisierte Programme benötigen zur Ausführung immer ein Laufzeitsystem, um die spekulativen Annahmen abzusichern und für den Fall des Nichtzutreffens die korrekte Ausführungssemantik sicherzustellen. Solche Laufzeitsysteme sollen die Ausführungszeit des parallelen Programms so wenig wie möglich beeinflussen. In dieser Arbeit untersuchen wir, inwiefern aktuelle Systeme, die Speicherzugriffe explizit und in Software beobachten, diese Anforderung erfüllen, und stellen Änderungen vor, die die Laufzeit massiv verbessern. Außerdem entwerfen wir zwei neue Systeme, die mithilfe von virtueller Speicherverwaltung das Programm indirekt beobachten und dadurch eine deutlich geringere Auswirkung auf die Laufzeit haben. Eines der vorgestellten Systeme ist mittels eines Moduls direkt in den Linux-Betriebssystemkern integriert und bietet so die bestmögliche Effizienz. Darüber hinaus bietet es weitreichendere Sicherheitsgarantien als alle bisherigen Techniken, indem sogar Systemaufrufe zum Beispiel zur Datei Ein- und Ausgabe in der spekulativen Isolation mit eingeschlossen sind. Wir zeigen an einer Reihe von Benchmarks die Überlegenheit unserer Spekulationssyteme über den derzeitigen Stand der Technik. Sämtliche unserer Erweiterungen und Neuentwicklungen stehen als open source zur freien Verfügung. Diese Arbeit ist in englischer Sprache verfasst.Manual parallelization is time consuming and error-prone. Automatic parallelization on the other hand is often unable to extract substantial parallelism. Using speculation, however, most of the parallelism can be exploited even of complex programs. Speculatively parallelized programs always need a runtime system during execution in order to ensure the validity of the speculative assumptions, and to ensure the correct semantics even in the case of misspeculation. These runtime systems should influence the execution time of the parallel program as little as possible. In this thesis, we investigate to which extend state-of-the-art systems which track memory accesses explicitly in software fulfill this requirement. We describe and implement changes which improve their performance substantially. We also design two new systems utilizing virtual memory abstraction to track memory changed implicitly, thus causing less overhead during execution. One of the new systems is integrated into the Linux kernel as a kernel module, providing the best possible performance. Furthermore it provides stronger soundness guarantees than any state-of-the-art system by also capturing system calls, hence including for example file I/O into speculative isolation. In a number of benchmarks we show the performance improvements of our virtual memory based systems over the state of the art. All our extensions and newly developed speculation systems are made available as open source

    How functional programming mattered

    Get PDF
    In 1989 when functional programming was still considered a niche topic, Hughes wrote a visionary paper arguing convincingly ‘why functional programming matters’. More than two decades have passed. Has functional programming really mattered? Our answer is a resounding ‘Yes!’. Functional programming is now at the forefront of a new generation of programming technologies, and enjoying increasing popularity and influence. In this paper, we review the impact of functional programming, focusing on how it has changed the way we may construct programs, the way we may verify programs, and fundamentally the way we may think about programs

    A speculative execution approach to provide semantically aware contention management for concurrent systems

    Get PDF
    PhD ThesisMost modern platforms offer ample potention for parallel execution of concurrent programs yet concurrency control is required to exploit parallelism while maintaining program correctness. Pessimistic con- currency control featuring blocking synchronization and mutual ex- clusion, has given way to transactional memory, which allows the composition of concurrent code in a manner more intuitive for the application programmer. An important component in any transactional memory technique however is the policy for resolving conflicts on shared data, commonly referred to as the contention management policy. In this thesis, a Universal Construction is described which provides contention management for software transactional memory. The technique differs from existing approaches given that multiple execution paths are explored speculatively and in parallel. In the resolution of conflicts by state space exploration, we demonstrate that both concur- rent conflicts and semantic conflicts can be solved, promoting multi- threaded program progression. We de ne a model of computation called Many Systems, which defines the execution of concurrent threads as a state space management problem. An implementation is then presented based on concepts from the model, and we extend the implementation to incorporate nested transactions. Results are provided which compare the performance of our approach with an established contention management policy, under varying degrees of concurrent and semantic conflicts. Finally, we provide performance results from a number of search strategies, when nested transactions are introduced

    New hardware support transactional memory and parallel debugging in multicore processors

    Get PDF
    This thesis contributes to the area of hardware support for parallel programming by introducing new hardware elements in multicore processors, with the aim of improving the performance and optimize new tools, abstractions and applications related with parallel programming, such as transactional memory and data race detectors. Specifically, we configure a hardware transactional memory system with signatures as part of the hardware support, and we develop a new hardware filter for reducing the signature size. We also develop the first hardware asymmetric data race detector (which is also able to tolerate them), based also in hardware signatures. Finally, we propose a new module of hardware signatures that solves some of the problems that we found in the previous tools related with the lack of flexibility in hardware signatures
    • …
    corecore