2,217 research outputs found

    A Safety-First Approach to Memory Models.

    Full text link
    Sequential consistency (SC) is arguably the most intuitive behavior for a shared-memory multithreaded program. It is widely accepted that language-level SC could significantly improve programmability of a multiprocessor system. However, efficiently supporting end-to-end SC remains a challenge as it requires that both compiler and hardware optimizations preserve SC semantics. Current concurrent languages support a relaxed memory model that requires programmers to explicitly annotate all memory accesses that can participate in a data-race ("unsafe" accesses). This requirement allows compiler and hardware to aggressively optimize unannotated accesses, which are assumed to be data-race-free ("safe" accesses), while still preserving SC semantics. However, unannotated data races are easy for programmers to accidentally introduce and are difficult to detect, and in such cases the safety and correctness of programs are significantly compromised. This dissertation argues instead for a safety-first approach, whereby every memory operation is treated as potentially unsafe by the compiler and hardware unless it is proven otherwise. The first solution, DRFx memory model, allows many common compiler and hardware optimizations (potentially SC-violating) on unsafe accesses and uses a runtime support to detect potential SC violations arising from reordering of unsafe accesses. On detecting a potential SC violation, execution is halted before the safety property is compromised. The second solution takes a different approach and preserves SC in both compiler and hardware. Both SC-preserving compiler and hardware are also built on the safety-first approach. All memory accesses are treated as potentially unsafe by the compiler and hardware. SC-preserving hardware relies on different static and dynamic techniques to identify safe accesses. Our results indicate that supporting SC at the language level is not expensive in terms of performance and hardware complexity. The dissertation also explores an extension of this safety-first approach for data-parallel accelerators such as Graphics Processing Units (GPUs). Significant microarchitectural differences between CPU and GPU require rethinking of efficient solutions for preserving SC in GPUs. The proposed solution based on our SC-preserving approach performs nearly on par with the baseline GPU that implements a data-race-free-0 memory model.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120794/1/ansingh_1.pd

    The Impact of Non-coherent Buffers on Lazy Hardware Transactional Memory Systems

    Get PDF
    Abstract When supported in silicon, transactional memory (TM

    Software caching techniques and hardware optimizations for on-chip local memories

    Get PDF
    Despite the fact that the most viable L1 memories in processors are caches, on-chip local memories have been a great topic of consideration lately. Local memories are an interesting design option due to their many benefits: less area occupancy, reduced energy consumption and fast and constant access time. These benefits are especially interesting for the design of modern multicore processors since power and latency are important assets in computer architecture today. Also, local memories do not generate coherency traffic which is important for the scalability of the multicore systems. Unfortunately, local memories have not been well accepted in modern processors yet, mainly due to their poor programmability. Systems with on-chip local memories do not have hardware support for transparent data transfers between local and global memories, and thus ease of programming is one of the main impediments for the broad acceptance of those systems. This thesis addresses software and hardware optimizations regarding the programmability, and the usage of the on-chip local memories in the context of both single-core and multicore systems. Software optimizations are related to the software caching techniques. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this thesis, we start optimizing traditional software cache by proposing a hierarchical, hybrid software-cache architecture. Afterwards, we develop few optimizations in order to speedup our hybrid software cache as much as possible. As the result of the software optimizations we obtain that our hybrid software cache performs from 4 to 10 times faster than traditional software cache on a set of NAS parallel benchmarks. We do not stop with software caching. We cover some other aspects of the architectures with on-chip local memories, such as the quality of the generated code and its correspondence with the quality of the buffer management in local memories, in order to improve performance of these architectures. Therefore, we run our research till we reach the limit in software and start proposing optimizations on the hardware level. Two hardware proposals are presented in this thesis. One is about relaxing alignment constraints imposed in the architectures with on-chip local memories and the other proposal is about accelerating the management of local memories by providing hardware support for the majority of actions performed in our software cache.Malgrat les memòries cau encara son el component basic pel disseny del subsistema de memòria, les memòries locals han esdevingut una alternativa degut a les seves característiques pel que fa a l’ocupació d’àrea, el seu consum energètic i el seu rendiment amb un temps d’accés ràpid i constant. Aquestes característiques son d’especial interès quan les properes arquitectures multi-nucli estan limitades pel consum de potencia i la latència del subsistema de memòria.Les memòries locals pateixen de limitacions respecte la complexitat en la seva programació, fet que dificulta la seva introducció en arquitectures multi-nucli, tot i els avantatges esmentats anteriorment. Aquesta tesi presenta un seguit de solucions basades en programari i maquinari específicament dissenyat per resoldre aquestes limitacions.Les optimitzacions del programari estan basades amb tècniques d'emmagatzematge de memòria cau suportades per llibreries especifiques. La memòria cau per programari és un sòlid mètode per proporcionar a l'usuari una visió transparent de l'arquitectura, però aquest enfocament pot patir d'un rendiment deficient. En aquesta tesi, es proposa una estructura jeràrquica i híbrida. Posteriorment, desenvolupem optimitzacions per tal d'accelerar l’execució del programari que suporta el disseny de la memòria cau. Com a resultat de les optimitzacions realitzades, obtenim que el nostre disseny híbrid es comporta de 4 a 10 vegades més ràpid que una implementació tradicional de memòria cau sobre un conjunt d’aplicacions de referencia, com son els “NAS parallel benchmarks”.El treball de tesi inclou altres aspectes de les arquitectures amb memòries locals, com ara la qualitat del codi generat i la seva correspondència amb la qualitat de la gestió de memòria intermèdia en les memòries locals, per tal de millorar el rendiment d'aquestes arquitectures. La tesi desenvolupa propostes basades estrictament en el disseny de nou maquinari per tal de millorar el rendiment de les memòries locals quan ja no es possible realitzar mes optimitzacions en el programari. En particular, la tesi presenta dues propostes de maquinari: una relaxa les restriccions imposades per les memòries locals respecte l’alineament de dades, l’altra introdueix maquinari específic per accelerar les operacions mes usuals sobre les memòries locals

    Especulação de threads usando arquiteturas de memória transacional em hardware

    Get PDF
    Orientadores: Guido Costa Souza de Araújo, José Nelson AmaralTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Especulação no nível de threads (TLS) é uma técnica em hardware/software que possibilita a execução paralela de múltiplas iterações de um laço, inclusive na presença de algumas dependências loop-carried. TLS exige mecanismos em hardware para auxiliar a detecção de conflitos, o armazenamento especulativo, os commits das transações em ordem, e o roll-back das transações. Trabalhos anteriores exploraram enfoques para implementar TLS, tanto em hardware dedicado como puramente em software, e tentaram predizer o desempenho de futuras implementações de TLS em hardware. Contudo, não existe nenhum processador comercial que forneça suporte direto para TLS. Entretanto, execução especulativa é suportada na forma de Memória Transacional em Hardware (HTM) ¿ disponível em processadores modernos como Intel Core e IBM POWER8. HTM implementa três características essenciais para TLS: detecção de conflitos, armazenamento especulativo, e roll-back de transações. Antes de aplicar TLS a um laço quente, é necessário determinar se o laço tem potencial para ser especulado. Um laço pode ser adequado para TLS se a probabilidade de dependências loop-carried em tempo de execução for baixa; para estimar esta probabilidade um perfilamento de dependências do laço deve ser usado. Este trabalho apresenta um verificador das dependências loop-carried integrado como uma nova extensão de OpenMP, a diretiva parallel for check, a qual pode ser usada para ajudar desenvolvedores a identificarem a existência destas dependências em construções parallel for. Este trabalho também apresenta uma análise detalhada da aplicação de HTM para a paralelização de laços com TLS e descreve uma avaliação cuidadosa da implementação de TLS usando HTMs disponíveis em processadores modernos. Como resultado, esta tese proporciona evidências para validar várias afirmações importantes sobre o desempenho de TLS nestas arquiteturas. Os resultados experimentais mostram que TLS usando HTM produz speedups de até 3.8× para alguns laços. Finalmente, este trabalho descreve uma nova técnica de especulação para a otimização, e execução simultânea, de múltiplos traços de regiões de código quente. Esta técnica, chamada Speculative Trace Optimization (STO), enumera, otimiza, e executa especulativamente traços de laços quentes. Isto requer o suporte em hardware disponível em sistemas HTM. Este trabalho discute as características necessárias para suportar STO: multi-versão, resolução de conflitos tardia, detecção de conflitos prematura, e sincronização das transações. Uma revisão das arquiteturas HTM existentes ¿ Intel TSX, IBM BG/Q, e IBM POWER8 ¿ mostra que nenhuma delas tem todas as características requeridas para implementar STO. Entretanto, este trabalho mostra que STO pode ser implementado nas arquiteturas HTM existentes através da adição de privatização e código para esperar/retomarAbstract: Thread-Level Speculation (TLS) is a hardware/software technique that enables the execution of multiple loop iterations in parallel, even in the presence of some loop-carried dependences. TLS requires hardware mechanisms to support conflict detection, speculative storage, in-order commit of transactions, and transaction roll-back. Prior research has investigated approaches to implement TLS, either on dedicated hardware or purely in software, and has attempted to predict the performance of future TLS hardware implementations. Nevertheless, there is no off-the-shelf processor that provides direct support for TLS. Speculative execution is supported, however, in the form of Hardware Transactional Memory (HTM) ¿ available in recent processors such as the Intel Core and the IBM POWER8. HTM implements three key features required by TLS: conflict detection, speculative storage, and transaction roll-back. Before applying TLS to a hot loop, it is necessary to determine if the loop has potential to be amenable. A loop could be amenable if the probability of loop-carried dependences at runtime is low; to measure this probability loop dependence profiling is used. This project presents a novel dynamic loop-carried dependence checker integrated as a new extension to OpenMP, the parallel for check construct, which can be used to help programmers identify the existence of loop-carried dependences in parallel for constructs. This work also presents a detailed analysis of the application of HTM support for loop parallelization with TLS and describes a careful evaluation of the implementation of TLS on the HTM extensions available in such machines. As a result, it provides evidence to support several important claims about the performance of TLS over HTM in the Intel Core and the IBM POWER8 architectures. Experimental results reveal that by implementing TLS on top of HTM, speed-ups of up to 3.8× can be obtained for some loops. Finally, this work describes a novel speculation technique for the optimization, and simultaneous execution, of multiple alternative traces of hot code regions. This technique, called Speculative Trace Optimization (STO), enumerates, optimizes, and speculatively executes traces of hot loops. It requires hardware support that can be provided in a similar fashion as that available in HTM systems. This work discusses the necessary features to support STO, namely multi-versioning, lazy conflict resolution, eager conflict detection, and transaction synchronization. A review of existing HTM architectures ¿ Intel TSX, IBM BG/Q, and IBM POWER8 ¿ shows that none of them has all the features required to implement STO. However, this work demonstrates that STO can be implemented on top of existing HTM architectures through the addition of privatization and wait/resume codeDoutoradoCiência da ComputaçãoDoutor em Ciência da ComputaçãoCAPESFAPESPCNP

    Performance Optimization Strategies for Transactional Memory Applications

    Get PDF
    This thesis presents tools for Transactional Memory (TM) applications that cover multiple TM systems (Software, Hardware, and hybrid TM) and use information of all different layers of the TM software stack. Therefore, this thesis addresses a number of challenges to extract static information, information about the run time behavior, and expert-level knowledge to develop these new methods and strategies for the optimization of TM applications

    Towards lightweight and high-performance hardware transactional memory

    Get PDF
    Conventional lock-based synchronization serializes accesses to critical sections guarded by the same lock. Using multiple locks brings the possibility of a deadlock or a livelock in the program, making parallel programming a difficult task. Transactional Memory (TM) is a promising paradigm for parallel programming, offering an alternative to lock-based synchronization. TM eliminates the risk of deadlocks and livelocks, while it provides the desirable semantics of Atomicity, Consistency, and Isolation of critical sections. TM speculatively executes a series of memory accesses as a single, atomic, transaction. The speculative changes of a transaction are kept private until the transaction commits. If a transaction can break the atomicity or cause a deadlock or livelock, the TM system aborts the transaction and rolls back the speculative changes. To be effective, a TM implementation should provide high performance and scalability. While implementations of TM in pure software (STM) do not provide desirable performance, Hardware TM (HTM) implementations introduce much smaller overhead and have relatively good scalability, due to their better control of hardware resources. However, many HTM systems support only the transactions that fit limited hardware resources (for example, private caches), and fall back to software mechanisms if hardware limits are reached. These HTM systems, called best-effort HTMs, are not desirable since they force a programmer to think in terms of hardware limits, to use both HTM and STM, and to manage concurrent transactions in HTM and STM. In contrast with best-effort HTMs, unbounded HTM systems support overflowed transactions, that do not fit into private caches. Unbounded HTM systems often require complex protocols or expensive hardware mechanisms for conflict detection between overflowed transactions. In addition, an execution with overflowed transactions is often much slower than an execution that has only regular transactions. This is typically due to restrictive or approximative conflict management mechanism used for overflowed transactions. In this thesis, we study hardware implementations of transactional memory, and make three main contributions. First, we improve the general performance of HTM systems by proposing a scalable protocol for conflict management. The protocol has precise conflict detection, in contrast with often-employed inexact Bloom-filter-based conflict detection, which often falsely report conflicts between transactions. Second, we propose a best-effort HTM that utilizes the new scalable conflict detection protocol, termed EazyHTM. EazyHTM allows parallel commits for all non-conflicting transactions, and generally simplifies transaction commits. Finally, we propose an unbounded HTM that extends and improves the initial protocol for conflict management, and we name it EcoTM. EcoTM features precise conflict detection, and it efficiently supports large as well as small and short transactions. The key idea of EcoTM is to leverage an observation that very few locations are actually conflicting, even if applications have high contention. In EcoTM, each core locally detects if a cache line is non-conflicting, and conflict detection mechanism is invoked only for the few potentially conflicting cache lines.La Sincronización tradicional basada en los cerrojos de exclusión mutua (locks) serializa los accesos a las secciones críticas protegidas este cerrojo. La utilización de varios cerrojos en forma concurrente y/o paralela aumenta la posibilidad de entrar en abrazo mortal (deadlock) o en un bloqueo activo (livelock) en el programa, está es una de las razones por lo cual programar en forma paralela resulta ser mucho mas dificultoso que programar en forma secuencial. La memoria transaccional (TM) es un paradigma prometedor para la programación paralela, que ofrece una alternativa a los cerrojos. La memoria transaccional tiene muchas ventajas desde el punto de vista tanto práctico como teórico. TM elimina el riesgo de bloqueo mutuo y de bloqueo activo, mientras que proporciona una semántica de atomicidad, coherencia, aislamiento con características similares a las secciones críticas. TM ejecuta especulativamente una serie de accesos a la memoria como una transacción atómica. Los cambios especulativos de la transacción se mantienen privados hasta que se confirma la transacción. Si una transacción entra en conflicto con otra transacción o sea que alguna de ellas escribe en una dirección que la otra leyó o escribió, o se entra en un abrazo mortal o en un bloqueo activo, el sistema de TM aborta la transacción y revierte los cambios especulativos. Para ser eficaz, una implementación de TM debe proporcionar un alto rendimiento y escalabilidad. Las implementaciones de TM en el software (STM) no proporcionan este desempeño deseable, en cambio, las mplementaciones de TM en hardware (HTM) tienen mejor desempeño y una escalabilidad relativamente buena, debido a su mejor control de los recursos de hardware y que la resolución de los conflictos así el mantenimiento y gestión de los datos se hace en hardware. Sin embargo, muchos de los sistemas de HTM están limitados a los recursos de hardware disponibles, por ejemplo el tamaño de las caches privadas, y dependen de mecanismos de software para cuando esos límites son sobrepasados. Estos sistemas HTM, llamados best-effort HTM no son deseables, ya que obligan al programador a pensar en términos de los límites existentes en el hardware que se esta utilizando, así como en el sistema de STM que se llama cuando los recursos son sobrepasados. Además, tiene que resolver que transacciones hardware y software se ejecuten concurrentemente. En cambio, los sistemas de HTM ilimitados soportan un numero de operaciones ilimitadas o sea no están restringidos a límites impuestos artificialmente por el hardware, como ser el tamaño de las caches o buffers internos. Los sistemas HTM ilimitados por lo general requieren protocolos complejos o mecanismos muy costosos para la detección de conflictos y el mantenimiento de versiones de los datos entre las transacciones. Por otra parte, la ejecución de transacciones es a menudo mucho más lenta que en una ejecución sobre un sistema de HTM que este limitado. Esto es debido al que los mecanismos utilizados en el HTM limitado trabaja con conjuntos de datos relativamente pequeños que caben o están muy cerca del núcleo del procesador. En esta tesis estudiamos implementaciones de TM en hardware. Presentaremos tres contribuciones principales: Primero, mejoramos el rendimiento general de los sistemas, al proponer un protocolo escalable para la gestión de conflictos. El protocolo detecta los conflictos de forma precisa, en contraste con otras técnicas basadas en filtros Bloom, que pueden reportar conflictos falsos entre las transacciones. Segundo, proponemos un best-effort HTM que utiliza el nuevo protocolo escalable detección de conflictos, denominado EazyHTM. EazyHTM permite la ejecución completamente paralela de todas las transacciones sin conflictos, y por lo general simplifica la ejecución. Por último, proponemos una extensión y mejora del protocolo inicial para la gestión de conflictos, que llamaremos EcoTM. EcoTM cuenta con detección de conflictos precisa, eficiente y es compatible tanto con transacciones grandes como con pequeñas. La idea clave de EcoTM es aprovechar la observación que en muy pocas ubicaciones de memoria aparecen los conflictos entre las transacciones, incluso en aplicaciones tienen muchos conflictos. En EcoTM, cada núcleo detecta localmente si la línea es conflictiva, además existe un mecanismo de detección de conflictos detallado que solo se activa para las pocas líneas de memoria que son potencialmente conflictivas

    Everything You Want to Know About Pointer-Based Checking

    Get PDF
    Lack of memory safety in C/C++ has resulted in numerous security vulnerabilities and serious bugs in large software systems. This paper highlights the challenges in enforcing memory safety for C/C++ programs and progress made as part of the SoftBoundCETS project. We have been exploring memory safety enforcement at various levels - in hardware, in the compiler, and as a hardware-compiler hybrid - in this project. Our research has identified that maintaining metadata with pointers in a disjoint metadata space and performing bounds and use-after-free checking can provide comprehensive memory safety. We describe the rationale behind the design decisions and its ramifications on various dimensions, our experience with the various variants that we explored in this project, and the lessons learned in the process. We also describe and analyze the forthcoming Intel Memory Protection Extensions (MPX) that provides hardware acceleration for disjoint metadata and pointer checking in mainstream hardware, which is expected to be available later this year

    A Survey on Thread-Level Speculation Techniques

    Get PDF
    Producción CientíficaThread-Level Speculation (TLS) is a promising technique that allows the parallel execution of sequential code without relying on a prior, compile-time-dependence analysis. In this work, we introduce the technique, present a taxonomy of TLS solutions, and summarize and put into perspective the most relevant advances in this field.MICINN (Spain) and ERDF program of the European Union: HomProg-HetSys project (TIN2014-58876-P), CAPAP-H5 network (TIN2014-53522-REDT), and COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS)
    corecore