133 research outputs found
Aikido: Accelerating shared data dynamic analyses
Despite a burgeoning demand for parallel programs, the tools available to developers working on shared-memory multicore processors have lagged behind. One reason for this is the lack of hardware support for inspecting the complex behavior of these parallel programs. Inter-thread communication, which must be instrumented for many types of analyses, may occur with any memory operation. To detect such thread communication in software, many existing tools require the instrumentation of all memory operations, which leads to significant performance overheads. To reduce this overhead, some existing tools resort to random sampling of memory operations, which introduces false negatives. Unfortunately, neither of these approaches provide the speed and accuracy programmers have traditionally expected from their tools. In this work, we present Aikido, a new system and framework that enables the development of efficient and transparent analyses that operate on shared data. Aikido uses a hybrid of existing hardware features and dynamic binary rewriting to detect thread communication with low overhead. Aikido runs a custom hypervisor below the operating system, which exposes per-thread hardware protection mechanisms not available in any widely used operating system. This hybrid approach allows us to benefit from the low cost of detecting memory accesses with hardware, while maintaining the word-level accuracy of a software-only approach. To evaluate our framework, we have implemented an Aikido-enabled vector clock race detector. Our results show that the Aikido enabled race-detector outperforms existing techniques that provide similar accuracy by up to 6.0x, and 76% on average, on the PARSEC benchmark suite.National Science Foundation (U.S.) (NSF grant CCF-0832997)National Science Foundation (U.S.) (DOE SC0005288)United States. Defense Advanced Research Projects Agency (DARPA HR0011-10- 9-0009
Recommended from our members
Software lock elision for x86 machine code
More than a decade after becoming a topic of intense research there is no
transactional memory hardware nor any examples of software transactional memory
use outside the research community. Using software transactional memory in large
pieces of software needs copious source code annotations and often means
that standard compilers and debuggers can no longer be used. At the same time,
overheads associated with software transactional memory fail to motivate
programmers to expend the needed effort to use software transactional
memory. The only way around the overheads in the case of general unmanaged code
is the anticipated availability of hardware support. On the other hand, architects
are unwilling to devote power and area budgets in mainstream microprocessors to
hardware transactional memory, pointing to transactional memory being a
"niche" programming construct. A deadlock has thus ensued that is blocking
transactional memory use and experimentation in the mainstream.
This dissertation covers the design and construction of a software transactional
memory runtime system called SLE_x86 that can potentially break this
deadlock by decoupling transactional memory from programs using it. Unlike most
other STM designs, the core design principle is transparency rather than
performance. SLE_x86 operates at the level of x86 machine code, thereby
becoming immediately applicable to binaries for the popular x86
architecture. The only requirement is that the binary synchronise using known
locking constructs or calls such as those in Pthreads or OpenMP
libraries. SLE_x86 provides speculative lock elision (SLE) entirely in
software, executing critical sections in the binary using transactional
memory. Optionally, the critical sections can also be executed without using
transactions by acquiring the protecting lock.
The dissertation makes a careful analysis of the impact on performance due to
the demands of the x86 memory consistency model and the need to transparently
instrument x86 machine code. It shows that both of these problems can be
overcome to reach a reasonable level of performance, where transparent
software transactional memory can perform better than a lock. SLE_x86 can
ensure that programs are ready for transactional memory in any form, without
being explicitly written for it
Recommended from our members
Providing Easy to Use and Fast Programming Support for Non-Volatile Memories
Non-Volatile Memory (NVM) technologies, such as 3D XPoint, offer DRAM-like performance and byte-addressable access to persistent data. NVMs promise an opportunity for fast, persistent data structures, and a wide range of applications stand to benefit from the performance potential of these technologies. These potential benefits are greatest when applications access NVM directly via load/store instructions rather than conventional file-based interfaces. Directly accessing NVM presents several challenges. In particular, applications need guaranteed consistency and safety semantics to protect their data structures in the face of system failures and programming errors.Implementing data structures that meet these requirements is challenging and error-prone. Existing methods for building persistent data structures require either in-depth code changes to an existing data structure or rewriting the data structure from scratch. Unfortunately, both of these methods are labor-intensive and error-prone.Failure-atomicity libraries and programming language extensions can simplify this task. However, all the proposed solutions either require pervasive changes to existing software or incur unacceptable overheads to runtime performance. As a result, porting legacy applications to leverage NVM is likely to be prohibitively difficult and time-consuming.This dissertation first presents Breeze, an NVM toolchain that minimizes the changes necessary to enable legacy code to reap the benefits of directly accessing NVM. In contrast to PMDK and NVM-Direct, Breeze reduces the programming effort of porting Memcached and MongoDB by up to 2.8×, while providing equal or superior performance.Second, it introduces NVHooks, a compiler that automatically annotates NVM accesses and avoids disruptive and error-prone changes to programs. NVHooks reduces the cost of these annotations by applying novel, NVM-specific optimizations to their placement. For our tested benchmarks, NVHooks matches the performance of hand-annotated code while minimizing programmer effort.Finally, it presents Pronto, a new NVM library that reduces the programming effort required to add persistence to volatile data structures. Pronto uses asynchronous semantic logging (ASL) to allow adding persistence to the existing volatile data structure (e.g., C++ Standard Template Library containers) with minor programming effort. ASL moves most durability code off the critical path. Our evaluation shows Pronto data structures outperform highly-optimized NVM data structures by a large margin
Towards lightweight and high-performance hardware transactional memory
Conventional lock-based synchronization serializes accesses to critical sections guarded by the same lock. Using multiple locks brings the possibility of a deadlock or a livelock in the program, making parallel programming a difficult task. Transactional Memory (TM) is a promising paradigm for parallel programming, offering an alternative to lock-based synchronization. TM eliminates the risk of deadlocks and livelocks, while it provides the desirable semantics of Atomicity, Consistency, and Isolation of critical sections. TM speculatively executes a series of memory accesses as a single, atomic, transaction. The speculative changes of a transaction are kept private until the transaction commits. If a transaction can break the atomicity or cause a deadlock or livelock, the TM system aborts the transaction and rolls back the speculative changes.
To be effective, a TM implementation should provide high performance and scalability. While implementations of TM in pure software (STM) do not provide desirable performance, Hardware TM (HTM) implementations introduce much smaller overhead and have relatively good scalability, due to their better control of hardware resources. However, many HTM systems support only the transactions that fit limited hardware resources (for example, private caches), and fall back to software mechanisms if hardware limits are reached. These HTM systems, called best-effort HTMs, are not desirable since they force a programmer to think in terms of hardware limits, to use both HTM and STM, and to manage concurrent transactions in HTM and STM. In contrast with best-effort HTMs, unbounded HTM systems support overflowed transactions, that do not fit into private caches. Unbounded HTM systems often require complex protocols or expensive hardware mechanisms for conflict detection between overflowed transactions. In addition, an execution with overflowed transactions is often much slower than an execution that has only regular transactions. This is typically due to restrictive or approximative conflict management mechanism used for overflowed transactions.
In this thesis, we study hardware implementations of transactional memory, and make three main contributions. First, we improve the general performance of HTM systems by proposing a scalable protocol for conflict management. The protocol has precise conflict detection, in contrast with often-employed inexact Bloom-filter-based conflict detection, which often falsely report conflicts between transactions. Second, we propose a best-effort HTM that utilizes the new scalable conflict detection protocol, termed EazyHTM. EazyHTM allows parallel commits for all non-conflicting transactions, and generally simplifies transaction commits.
Finally, we propose an unbounded HTM that extends and improves the initial protocol for conflict management, and we name it EcoTM. EcoTM features precise conflict detection, and it efficiently supports large as well as small and short transactions. The key idea of EcoTM is to leverage an observation that very few locations are actually conflicting, even if applications have high contention. In EcoTM, each core locally detects if a cache line is non-conflicting, and conflict detection mechanism is invoked only for the few potentially conflicting cache lines.La Sincronización tradicional basada en los cerrojos de exclusión mutua (locks) serializa los accesos a las secciones crÃticas protegidas este cerrojo. La utilización de varios cerrojos en forma concurrente y/o paralela aumenta la posibilidad de entrar en abrazo mortal (deadlock) o en un bloqueo activo (livelock) en el programa, está es una de las razones por lo cual programar en forma paralela resulta ser mucho mas dificultoso que programar en forma secuencial. La memoria transaccional (TM) es un paradigma prometedor para la programación paralela, que ofrece una alternativa a los cerrojos. La memoria transaccional tiene muchas ventajas desde el punto de vista tanto práctico como teórico. TM elimina el riesgo de bloqueo mutuo y de bloqueo activo, mientras que proporciona una semántica de atomicidad, coherencia, aislamiento con caracterÃsticas similares a las secciones crÃticas. TM ejecuta especulativamente una serie de accesos a la memoria como una transacción atómica.
Los cambios especulativos de la transacción se mantienen privados hasta que se confirma la transacción. Si una transacción entra en conflicto con otra transacción o sea que alguna de ellas escribe en una dirección que la otra leyó o escribió, o se entra en un abrazo mortal o en un bloqueo activo, el sistema de TM aborta la transacción y revierte los cambios especulativos. Para ser eficaz, una implementación de TM debe proporcionar un alto rendimiento y escalabilidad. Las implementaciones de TM en el software (STM) no proporcionan este desempeño deseable, en cambio, las mplementaciones de TM en hardware (HTM) tienen mejor desempeño y una escalabilidad relativamente buena, debido a su mejor control de los recursos de hardware y que la resolución de los conflictos asà el mantenimiento y gestión de los datos se hace en hardware. Sin embargo, muchos de los sistemas de HTM están limitados a los recursos de hardware disponibles, por ejemplo el tamaño de las caches privadas, y dependen de mecanismos de software para cuando esos lÃmites son sobrepasados. Estos sistemas HTM, llamados best-effort HTM no son deseables, ya que obligan al programador a pensar en términos de los lÃmites existentes en el hardware que se esta utilizando, asà como en el sistema de STM que se llama cuando los recursos son sobrepasados. Además, tiene que resolver que transacciones hardware y software se ejecuten concurrentemente. En cambio, los sistemas
de HTM ilimitados soportan un numero de operaciones ilimitadas o sea no están restringidos a lÃmites impuestos artificialmente por el hardware, como ser el tamaño de las caches o buffers internos. Los sistemas HTM ilimitados por lo general requieren protocolos complejos o mecanismos muy costosos para la detección de conflictos y el mantenimiento de versiones de los datos entre las transacciones. Por otra parte, la ejecución de transacciones es a menudo mucho más lenta que en una ejecución sobre un sistema de HTM que este limitado. Esto es debido al que los mecanismos utilizados en el HTM
limitado trabaja con conjuntos de datos relativamente pequeños que caben o están muy cerca del núcleo del procesador.
En esta tesis estudiamos implementaciones de TM en hardware. Presentaremos tres contribuciones principales: Primero, mejoramos el rendimiento general de los sistemas, al proponer un protocolo escalable para la gestión de conflictos. El protocolo detecta los conflictos de forma precisa, en contraste con otras técnicas basadas en filtros Bloom, que pueden reportar conflictos falsos entre las transacciones. Segundo, proponemos un best-effort HTM que utiliza el nuevo protocolo escalable detección de conflictos, denominado EazyHTM. EazyHTM permite la ejecución completamente paralela de todas las transacciones sin conflictos, y por lo general simplifica la ejecución. Por último, proponemos una extensión y mejora del protocolo inicial para la gestión de conflictos, que llamaremos EcoTM. EcoTM cuenta con detección de conflictos precisa, eficiente y es compatible tanto con transacciones grandes como con pequeñas. La idea clave de EcoTM es aprovechar la observación que en muy pocas ubicaciones de memoria aparecen los conflictos entre las transacciones, incluso en aplicaciones tienen muchos conflictos. En EcoTM, cada núcleo detecta localmente si la lÃnea es conflictiva, además existe un mecanismo de detección de conflictos detallado que solo se activa para las pocas lÃneas de memoria que son potencialmente conflictivas
New hardware support transactional memory and parallel debugging in multicore processors
This thesis contributes to the area of hardware support for parallel programming by introducing new hardware elements in multicore processors, with the aim of improving the performance and optimize new tools, abstractions and applications related with parallel programming, such as transactional memory and data race detectors. Specifically, we configure a hardware transactional memory system with signatures as part of the hardware support, and we develop a new hardware filter for reducing the signature size. We also develop the first hardware asymmetric data race detector (which is also able to tolerate them), based also in hardware signatures. Finally, we propose a new module of hardware signatures that solves some of the problems that we found in the previous tools related with the lack of flexibility in hardware signatures
Drinking from Both Glasses: Combining Pessimistic and Optimistic Tracking of Cross-Thread Dependences *
Abstract It is notoriously challenging to develop parallel software systems that are both scalable and correct. Runtime support for parallelism-such as multithreaded record & replay, data race detectors, transactional memory, and enforcement of stronger memory models-helps achieve these goals, but existing commodity solutions slow programs substantially in order to track (i.e., detect or control) an execution's cross-thread dependences accurately. Prior work tracks cross-thread dependences either "pessimistically," slowing every program access, or "optimistically," allowing for lightweight instrumentation of most accesses but dramatically slowing accesses involved in cross-thread dependences. This paper seeks to hybridize pessimistic and optimistic tracking, which is challenging because there exists a fundamental mismatch between pessimistic and optimistic tracking. We address this challenge based on insights about how dependence tracking and program synchronization interact, and introduce a novel approach called hybrid tracking. Hybrid tracking is suitable for building efficient runtime support, which we demonstrate by building hybridtracking-based versions of a dependence recorder and a region serializability enforcer. An adaptive, profile-based policy makes runtime decisions about switching between pessimistic and optimistic tracking. Our evaluation shows that hybrid tracking enables runtime support to overcome the performance limitations of both pessimistic and optimistic tracking alone
- …