9 research outputs found

    Reuse Detector: improving the management of STT-RAM SLLCs

    Get PDF
    Various constraints of Static Random Access Memory (SRAM) are leading to consider new memory technologies as candidates for building on-chip shared last-level caches (SLLCs). Spin-Transfer Torque RAM (STT-RAM) is currently postulated as the prime contender due to its better energy efficiency, smaller die footprint and higher scalability. However, STT-RAM also exhibits some drawbacks, like slow and energy-hungry write operations that need to be mitigated before it can be used in SLLCs for the next generation of computers. In this work, we address these shortcomings by leveraging a new management mechanism for STT-RAM SLLCs. This approach is based on the previous observation that although the stream of references arriving at the SLLC of a Chip MultiProcessor (CMP) exhibits limited temporal locality, it does exhibit reuse locality, i.e. those blocks referenced several times manifest high probability of forthcoming reuse. As such, conventional STT-RAM SLLC management mechanisms, mainly focused on exploiting temporal locality, result in low efficient behavior. In this paper, we employ a cache management mechanism that selects the contents of the SLLC aimed to exploit reuse locality instead of temporal locality. Specifically, our proposal consists in the inclusion of a Reuse Detector (RD) between private cache levels and the STT-RAM SLLC. Its mission is to detect blocks that do not exhibit reuse, in order to avoid their insertion in the SLLC, hence reducing the number of write operations and the energy consumption in the STT-RAM. Our evaluation, using multiprogrammed workloads in quad-core, eight-core and 16-core systems, reveals that our scheme reports on average, energy reductions in the SLLC in the range of 37–30%, additional energy savings in the main memory in the range of 6–8% and performance improvements of 3% (quad-core), 7% (eight-core) and 14% (16-core) compared with an STT-RAM SLLC baseline where no RD is employed. More importantly, our approach outperforms DASCA, the state-of-the-art STT-RAM SLLC management, reporting—depending on the specific scenario and the kind of applications used—SLLC energy savings in the range of 4–11% higher than those of DASCA, delivering higher performance in the range of 1.5–14% and additional improvements in DRAM energy consumption in the range of 2–9% higher than DASCA.Peer ReviewedPostprint (author's final draft

    Reuse Detector: Improving the management of STT-RAM SLLCs

    Get PDF
    Various constraints of Static Random Access Memory (SRAM) are leading to consider new memory technologies as candidates for building on-chip shared last-level caches (SLLCs). Spin-Transfer Torque RAM (STT-RAM) is currently postulated as the prime contender due to its better energy efficiency, smaller die footprint and higher scalability. However, STT-RAM also exhibits some drawbacks, like slow and energy-hungry write operations that need to be mitigated before it can be used in SLLCs for the next generation of computers. In this work, we address these shortcomings by leveraging a new management mechanism for STT-RAM SLLCs. This approach is based on the previous observation that although the stream of references arriving at the SLLC of a Chip MultiProcessor (CMP) exhibits limited temporal locality, it does exhibit reuse locality, i.e. those blocks referenced several times manifest high probability of forthcoming reuse. As such, conventional STT-RAM SLLC management mechanisms, mainly focused on exploiting temporal locality, result in low efficient behavior. In this paper, we employ a cache management mechanism that selects the contents of the SLLC aimed to exploit reuse locality instead of temporal locality. Specifically, our proposal consists in the inclusion of a Reuse Detector (RD) between private cache levels and the STT-RAM SLLC. Its mission is to detect blocks that do not exhibit reuse, in order to avoid their insertion in the SLLC, hence reducing the number of write operations and the energy consumption in the STT-RAM. Our evaluation, using multiprogrammed workloads in quad-core, eight-core and 16-core systems, reveals that our scheme reports on average, energy reductions in the SLLC in the range of 37–30%, additional energy savings in the main memory in the range of 6–8% and performance improvements of 3% (quad-core), 7% (eight-core) and 14% (16-core) compared with an STT-RAM SLLC baseline where no RD is employed. More importantly, our approach outperforms DASCA, the state-of-the-art STT-RAM SLLC management, reporting—depending on the specific scenario and the kind of applications used—SLLC energy savings in the range of 4–11% higher than those of DASCA, delivering higher performance in the range of 1.5–14% and additional improvements in DRAM energy consumption in the range of 2–9% higher than DASCA

    DrPin : um instrumentador dinâmico de binários para múltiplas arquiteturas de processadores

    Get PDF
    Orientador: Rodolfo Jardim de AzevedoDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: A complexidade dos programas está aumentando e as ferramentas usadas para seu desenvolvimento tem acompanhado tal evolução. Aplicações modernas dependem largamente de bibliotecas carregadas dinamicamente e algumas aplicações até geram código durante sua execução. Logo, ferramentas de análise estática, usadas para depurar e entender aplicações não são mais suficientes para se ter um panorama completo de uma aplicação. Como resultado, ferramentas de análise dinâmica (aquelas que são executadas durante o tempo de execução) estão sendo adotadas e integradas ao desenvolvimento e estudo de aplicacoes modernas. Entre essas, as ferramentas que operam diretamente no binário do programa são particularmente úteis no meio de inúmeras bibliotecas carregadas dinamicamente, onde o código-fonte pode não estar disponível. A construção de ferramentas que manipulam e instrumentam código binário durante sua execução é particularmente difícil e propensa a erros. Um pequeno erro pode resultar em um desvio completo do comportamento do programa sendo analisado. Por esse motivo, frameworks de Instrumentação dinâmica de binários (DBI) tornaram-se cada vez mais populares. Esses frameworks fornecem meios para criação de ferramentas de análise dinâmica de binarios com pouco esforço. Entre eles, o Pin 2 tem sido de longe o mais popular e fácil de usar. No entanto, desde o lançamento da série 4 do Linux Kernel, ele ficou sem suporte. Neste trabalho, nosso foco é voltado para o estudo dos desafios encontrados ao criar um novo DBI (DrPin) que tem como foco ser totalmente compatível com a API do Pin 2, ao mesmo tempo que também suporta várias arquiteturas (x86-64, x86, Arm, Aarch64) e sistemas Linux modernos. Atualmente, o DrPin suporta um total de 83 funções da API do Pin 2, o que o torna capaz de executar várias pintools originalmente escritas para o Pin 2 sem nenhuma modificação. Comparando o desempenho do DrPin com o Pin 2, para uma ferramenta simples que conta o número de instruções executadas, observamos que, para o benchmark SPECint 2006, somos, em média, apenas 10% mais lentos que o Pin e 11,6 vezes mais lentos que a execução nativa. Também exploramos um pouco o ecossistema em torno dos frameworks de instrumentação dinâmica de binários. Especificamente, estudamos e estendemos uma técnica que utiliza ferramentas de análise dinâmicas de binários, construida com a ajuda de frameworks DBI, para prever o desempenho de uma determinada arquitetura ao executar um programa ou benchmark específico, sem a necessidade de executar o programa ou benchmark inteiro. Em particular, estendemos a Metodologia SimPoint para obter ganhos adicionais na redução do tempo necessário para obter tais previsões. Mostramos que, considerando as semelhanças no comportamento do programa entre diferentes entradas, podemos reduzir ainda mais o tempo necessário para obter resultados de simulação de benchmarks inteiros. Especificamente para SPECint 2006, mostramos que o número de SimPoints (diretamente proporcional ao tempo de simulação) pode ser reduzido em média 32%, perdendo apenas 0,06% da precisão quando comparado a técnica original. Diminuindo a precisão em 0,5%, observamos que o tempo de simulação é reduzido em média 66%Abstract: Programs' complexity is rising and the tools used in their development changed to keep up with this evolution. Modern applications rely heavily on dynamically loaded shared libraries and some of them even generate code at runtime, therefore static analysis tools used to debug and understand applications are no longer sufficient to understand the full picture of an application. As a consequence, dynamic analysis tools (those that are executed during runtime) are being adopted and integrated into the development and study of modern applications. Among those, tools that operate directly on the program binary are particularly useful in the sea of dynamically loaded libraries, where the source code might not be readily available. Building tools that manipulate and instrument binary code at runtime is particularly difficult and error-prone. A minor bug can result in a complete disruption in the behavior of the binary code being analyzed. Because of that, Dynamic Binary Instrumentation (DBI) frameworks have become increasingly popular. Those frameworks provide means of building dynamic binary analysis tools with low effort. Among them, Pin 2 has been by far the most popular and easy to use. However, since the release of the Linux Kernel 4 series, it became unsupported. In this work we focus on studying the challenges faced when building a new DBI (DrPin) that seeks to be compatible with Pin 2 API, without the restrictions of Pin 3, that also runs multiple architectures (x86-64, x86, Arm, Aarch64), and on modern Linux systems. In total, currently, DrPin supports a total of 83 Pin 2 API functions, which makes it capable of running many pintools originally written for the Pin 2 framework without any modification. Comparing the performance of DrPin to the original Pin 2 for a simple tool that counts the number of instructions executed, we observed that for the SPECint 2006 benchmark we were, on average, only 10% slower than the Pin 2 framework and 11.6 times slower than the native execution. We also explored the ecosystem around DBI frameworks. Specifically we studied and extended one technique that makes use of dynamic binary tools, built with the help of DBI frameworks, to predict the performance of a given architecture when executing a particular program or benchmark without the need to run the entire program or benchmark. In particular, we extended the SimPoint Methodology to obtain further gains in the time required to obtain the predictions. We showed that by taking into account similarities in the program behavior among different inputs, we can further reduce the time it takes to get simulation results of entire benchmarks. Specifically for SPECint 2006, we showed that the number of SimPoints (which is directly proportional to the simulation time) can be reduced by an average of 32% while losing only 0.06% of the accuracy when compared to the original technique. Further decreasing the accuracy by 0.5%, we observed the simulation time is reduced by an average of 66%MestradoCiência da ComputaçãoMestre em Ciência da Computação143197/2018-5001CNPQCAPE

    Mitosis based speculative multithreaded architectures

    Get PDF
    In the last decade, industry made a right-hand turn and shifted towards multi-core processor designs, also known as Chip-Multi-Processors (CMPs), in order to provide further performance improvements under a reasonable power budget, design complexity, and validation cost. Over the years, several processor vendors have come out with multi-core chips in their product lines and they have become mainstream, with the number of cores increasing in each processor generation. Multi-core processors improve the performance of applications by exploiting Thread Level Parallelism (TLP) while the Instruction Level Parallelism (ILP) exploited by each individual core is limited. These architectures are very efficient when multiple threads are available for execution. However, single-thread sections of code (single-thread applications and serial sections of parallel applications) pose important constraints on the benefits achieved by parallel execution, as pointed out by Amdahl’s law. Parallel programming, even with the help of recently proposed techniques like transactional memory, has proven to be a very challenging task. On the other hand, automatically partitioning applications into threads may be a straightforward task in regular applications, but becomes much harder for irregular programs, where compilers usually fail to discover sufficient TLP. In this scenario, two main directions have been followed in the research community to take benefit of multi-core platforms: Speculative Multithreading (SpMT) and Non-Speculative Clustered architectures. The former splits a sequential application into speculative threads, while the later partitions the instructions among the cores based on data-dependences but avoid large degree of speculation. Despite the large amount of research on both these approaches, the proposed techniques so far have shown marginal performance improvements. In this thesis we propose novel schemes to speed-up sequential or lightly threaded applications in multi-core processors that effectively address the main unresolved challenges of previous approaches. In particular, we propose a SpMT architecture, called Mitosis, that leverages a powerful software value prediction technique to manage inter-thread dependences, based on pre-computation slices (p-slices). Thanks to the accuracy and low cost of this technique, Mitosis is able to effectively parallelize applications even in the presence of frequent dependences among threads. We also propose a novel architecture, called Anaphase, that combines the best of SpMT schemes and clustered architectures. Anaphase effectively exploits ILP, TLP and Memory Level Parallelism (MLP), thanks to its unique finegrain thread decomposition algorithm that adapts to the available parallelism in the application.Postprint (published version

    Mitosis based speculative multithreaded architectures

    Get PDF
    In the last decade, industry made a right-hand turn and shifted towards multi-core processor designs, also known as Chip-Multi-Processors (CMPs), in order to provide further performance improvements under a reasonable power budget, design complexity, and validation cost. Over the years, several processor vendors have come out with multi-core chips in their product lines and they have become mainstream, with the number of cores increasing in each processor generation. Multi-core processors improve the performance of applications by exploiting Thread Level Parallelism (TLP) while the Instruction Level Parallelism (ILP) exploited by each individual core is limited. These architectures are very efficient when multiple threads are available for execution. However, single-thread sections of code (single-thread applications and serial sections of parallel applications) pose important constraints on the benefits achieved by parallel execution, as pointed out by Amdahl’s law. Parallel programming, even with the help of recently proposed techniques like transactional memory, has proven to be a very challenging task. On the other hand, automatically partitioning applications into threads may be a straightforward task in regular applications, but becomes much harder for irregular programs, where compilers usually fail to discover sufficient TLP. In this scenario, two main directions have been followed in the research community to take benefit of multi-core platforms: Speculative Multithreading (SpMT) and Non-Speculative Clustered architectures. The former splits a sequential application into speculative threads, while the later partitions the instructions among the cores based on data-dependences but avoid large degree of speculation. Despite the large amount of research on both these approaches, the proposed techniques so far have shown marginal performance improvements. In this thesis we propose novel schemes to speed-up sequential or lightly threaded applications in multi-core processors that effectively address the main unresolved challenges of previous approaches. In particular, we propose a SpMT architecture, called Mitosis, that leverages a powerful software value prediction technique to manage inter-thread dependences, based on pre-computation slices (p-slices). Thanks to the accuracy and low cost of this technique, Mitosis is able to effectively parallelize applications even in the presence of frequent dependences among threads. We also propose a novel architecture, called Anaphase, that combines the best of SpMT schemes and clustered architectures. Anaphase effectively exploits ILP, TLP and Memory Level Parallelism (MLP), thanks to its unique finegrain thread decomposition algorithm that adapts to the available parallelism in the application

    Studies in Exascale Computer Architecture: Interconnect, Resiliency, and Checkpointing

    Full text link
    Today’s supercomputers are built from the state-of-the-art components to extract as much performance as possible to solve the most computationally intensive problems in the world. Building the next generation of exascale supercomputers, however, would require re-architecting many of these components to extract over 50x more performance than the current fastest supercomputer in the United States. To contribute towards this goal, two aspects of the compute node architecture were examined in this thesis: the on-chip interconnect topology and the memory and storage checkpointing platforms. As a first step, a skeleton exascale system was modeled to meet 1 exaflop of performance along with 100 petabytes of main memory. The model revealed that large kilo-core processors would be necessary to meet the exaflop performance goal; existing topologies, however, would not scale to those levels. To address this new challenge, we investigated and proposed asymmetric high-radix topologies that decoupled local and global communications and used different radix routers for switching network traffic at each level. The proposed topologies scaled more readily to higher numbers of cores with better latency and energy consumption than before. The vast number of components that the model revealed would be needed in these exascale systems cautioned towards better fault tolerance mechanisms. To address this challenge, we showed that local checkpoints within the compute node can be saved to a hybrid DRAM and SSD platform in order to write them faster without wearing out the SSD or consuming a lot of energy. A hybrid checkpointing platform allowed more frequent checkpoints to be made without sacrificing performance. Subsequently, we proposed switching to a DIMM-based SSD in order to perform fine-grained I/O operations that would be integral in interleaving checkpointing and computation while still providing persistence guarantees. Two more techniques that consolidate and overlap checkpointing were designed to better hide the checkpointing latency to the SSD.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/137096/1/sabeyrat_1.pd
    corecore