19 research outputs found

    DrPin : um instrumentador dinâmico de binários para múltiplas arquiteturas de processadores

    Get PDF
    Orientador: Rodolfo Jardim de AzevedoDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: A complexidade dos programas está aumentando e as ferramentas usadas para seu desenvolvimento tem acompanhado tal evolução. Aplicações modernas dependem largamente de bibliotecas carregadas dinamicamente e algumas aplicações até geram código durante sua execução. Logo, ferramentas de análise estática, usadas para depurar e entender aplicações não são mais suficientes para se ter um panorama completo de uma aplicação. Como resultado, ferramentas de análise dinâmica (aquelas que são executadas durante o tempo de execução) estão sendo adotadas e integradas ao desenvolvimento e estudo de aplicacoes modernas. Entre essas, as ferramentas que operam diretamente no binário do programa são particularmente úteis no meio de inúmeras bibliotecas carregadas dinamicamente, onde o código-fonte pode não estar disponível. A construção de ferramentas que manipulam e instrumentam código binário durante sua execução é particularmente difícil e propensa a erros. Um pequeno erro pode resultar em um desvio completo do comportamento do programa sendo analisado. Por esse motivo, frameworks de Instrumentação dinâmica de binários (DBI) tornaram-se cada vez mais populares. Esses frameworks fornecem meios para criação de ferramentas de análise dinâmica de binarios com pouco esforço. Entre eles, o Pin 2 tem sido de longe o mais popular e fácil de usar. No entanto, desde o lançamento da série 4 do Linux Kernel, ele ficou sem suporte. Neste trabalho, nosso foco é voltado para o estudo dos desafios encontrados ao criar um novo DBI (DrPin) que tem como foco ser totalmente compatível com a API do Pin 2, ao mesmo tempo que também suporta várias arquiteturas (x86-64, x86, Arm, Aarch64) e sistemas Linux modernos. Atualmente, o DrPin suporta um total de 83 funções da API do Pin 2, o que o torna capaz de executar várias pintools originalmente escritas para o Pin 2 sem nenhuma modificação. Comparando o desempenho do DrPin com o Pin 2, para uma ferramenta simples que conta o número de instruções executadas, observamos que, para o benchmark SPECint 2006, somos, em média, apenas 10% mais lentos que o Pin e 11,6 vezes mais lentos que a execução nativa. Também exploramos um pouco o ecossistema em torno dos frameworks de instrumentação dinâmica de binários. Especificamente, estudamos e estendemos uma técnica que utiliza ferramentas de análise dinâmicas de binários, construida com a ajuda de frameworks DBI, para prever o desempenho de uma determinada arquitetura ao executar um programa ou benchmark específico, sem a necessidade de executar o programa ou benchmark inteiro. Em particular, estendemos a Metodologia SimPoint para obter ganhos adicionais na redução do tempo necessário para obter tais previsões. Mostramos que, considerando as semelhanças no comportamento do programa entre diferentes entradas, podemos reduzir ainda mais o tempo necessário para obter resultados de simulação de benchmarks inteiros. Especificamente para SPECint 2006, mostramos que o número de SimPoints (diretamente proporcional ao tempo de simulação) pode ser reduzido em média 32%, perdendo apenas 0,06% da precisão quando comparado a técnica original. Diminuindo a precisão em 0,5%, observamos que o tempo de simulação é reduzido em média 66%Abstract: Programs' complexity is rising and the tools used in their development changed to keep up with this evolution. Modern applications rely heavily on dynamically loaded shared libraries and some of them even generate code at runtime, therefore static analysis tools used to debug and understand applications are no longer sufficient to understand the full picture of an application. As a consequence, dynamic analysis tools (those that are executed during runtime) are being adopted and integrated into the development and study of modern applications. Among those, tools that operate directly on the program binary are particularly useful in the sea of dynamically loaded libraries, where the source code might not be readily available. Building tools that manipulate and instrument binary code at runtime is particularly difficult and error-prone. A minor bug can result in a complete disruption in the behavior of the binary code being analyzed. Because of that, Dynamic Binary Instrumentation (DBI) frameworks have become increasingly popular. Those frameworks provide means of building dynamic binary analysis tools with low effort. Among them, Pin 2 has been by far the most popular and easy to use. However, since the release of the Linux Kernel 4 series, it became unsupported. In this work we focus on studying the challenges faced when building a new DBI (DrPin) that seeks to be compatible with Pin 2 API, without the restrictions of Pin 3, that also runs multiple architectures (x86-64, x86, Arm, Aarch64), and on modern Linux systems. In total, currently, DrPin supports a total of 83 Pin 2 API functions, which makes it capable of running many pintools originally written for the Pin 2 framework without any modification. Comparing the performance of DrPin to the original Pin 2 for a simple tool that counts the number of instructions executed, we observed that for the SPECint 2006 benchmark we were, on average, only 10% slower than the Pin 2 framework and 11.6 times slower than the native execution. We also explored the ecosystem around DBI frameworks. Specifically we studied and extended one technique that makes use of dynamic binary tools, built with the help of DBI frameworks, to predict the performance of a given architecture when executing a particular program or benchmark without the need to run the entire program or benchmark. In particular, we extended the SimPoint Methodology to obtain further gains in the time required to obtain the predictions. We showed that by taking into account similarities in the program behavior among different inputs, we can further reduce the time it takes to get simulation results of entire benchmarks. Specifically for SPECint 2006, we showed that the number of SimPoints (which is directly proportional to the simulation time) can be reduced by an average of 32% while losing only 0.06% of the accuracy when compared to the original technique. Further decreasing the accuracy by 0.5%, we observed the simulation time is reduced by an average of 66%MestradoCiência da ComputaçãoMestre em Ciência da Computação143197/2018-5001CNPQCAPE

    Reuse Detector: improving the management of STT-RAM SLLCs

    Get PDF
    Various constraints of Static Random Access Memory (SRAM) are leading to consider new memory technologies as candidates for building on-chip shared last-level caches (SLLCs). Spin-Transfer Torque RAM (STT-RAM) is currently postulated as the prime contender due to its better energy efficiency, smaller die footprint and higher scalability. However, STT-RAM also exhibits some drawbacks, like slow and energy-hungry write operations that need to be mitigated before it can be used in SLLCs for the next generation of computers. In this work, we address these shortcomings by leveraging a new management mechanism for STT-RAM SLLCs. This approach is based on the previous observation that although the stream of references arriving at the SLLC of a Chip MultiProcessor (CMP) exhibits limited temporal locality, it does exhibit reuse locality, i.e. those blocks referenced several times manifest high probability of forthcoming reuse. As such, conventional STT-RAM SLLC management mechanisms, mainly focused on exploiting temporal locality, result in low efficient behavior. In this paper, we employ a cache management mechanism that selects the contents of the SLLC aimed to exploit reuse locality instead of temporal locality. Specifically, our proposal consists in the inclusion of a Reuse Detector (RD) between private cache levels and the STT-RAM SLLC. Its mission is to detect blocks that do not exhibit reuse, in order to avoid their insertion in the SLLC, hence reducing the number of write operations and the energy consumption in the STT-RAM. Our evaluation, using multiprogrammed workloads in quad-core, eight-core and 16-core systems, reveals that our scheme reports on average, energy reductions in the SLLC in the range of 37–30%, additional energy savings in the main memory in the range of 6–8% and performance improvements of 3% (quad-core), 7% (eight-core) and 14% (16-core) compared with an STT-RAM SLLC baseline where no RD is employed. More importantly, our approach outperforms DASCA, the state-of-the-art STT-RAM SLLC management, reporting—depending on the specific scenario and the kind of applications used—SLLC energy savings in the range of 4–11% higher than those of DASCA, delivering higher performance in the range of 1.5–14% and additional improvements in DRAM energy consumption in the range of 2–9% higher than DASCA.Peer ReviewedPostprint (author's final draft

    Reuse Detector: Improving the management of STT-RAM SLLCs

    Get PDF
    Various constraints of Static Random Access Memory (SRAM) are leading to consider new memory technologies as candidates for building on-chip shared last-level caches (SLLCs). Spin-Transfer Torque RAM (STT-RAM) is currently postulated as the prime contender due to its better energy efficiency, smaller die footprint and higher scalability. However, STT-RAM also exhibits some drawbacks, like slow and energy-hungry write operations that need to be mitigated before it can be used in SLLCs for the next generation of computers. In this work, we address these shortcomings by leveraging a new management mechanism for STT-RAM SLLCs. This approach is based on the previous observation that although the stream of references arriving at the SLLC of a Chip MultiProcessor (CMP) exhibits limited temporal locality, it does exhibit reuse locality, i.e. those blocks referenced several times manifest high probability of forthcoming reuse. As such, conventional STT-RAM SLLC management mechanisms, mainly focused on exploiting temporal locality, result in low efficient behavior. In this paper, we employ a cache management mechanism that selects the contents of the SLLC aimed to exploit reuse locality instead of temporal locality. Specifically, our proposal consists in the inclusion of a Reuse Detector (RD) between private cache levels and the STT-RAM SLLC. Its mission is to detect blocks that do not exhibit reuse, in order to avoid their insertion in the SLLC, hence reducing the number of write operations and the energy consumption in the STT-RAM. Our evaluation, using multiprogrammed workloads in quad-core, eight-core and 16-core systems, reveals that our scheme reports on average, energy reductions in the SLLC in the range of 37–30%, additional energy savings in the main memory in the range of 6–8% and performance improvements of 3% (quad-core), 7% (eight-core) and 14% (16-core) compared with an STT-RAM SLLC baseline where no RD is employed. More importantly, our approach outperforms DASCA, the state-of-the-art STT-RAM SLLC management, reporting—depending on the specific scenario and the kind of applications used—SLLC energy savings in the range of 4–11% higher than those of DASCA, delivering higher performance in the range of 1.5–14% and additional improvements in DRAM energy consumption in the range of 2–9% higher than DASCA

    Mitosis based speculative multithreaded architectures

    Get PDF
    In the last decade, industry made a right-hand turn and shifted towards multi-core processor designs, also known as Chip-Multi-Processors (CMPs), in order to provide further performance improvements under a reasonable power budget, design complexity, and validation cost. Over the years, several processor vendors have come out with multi-core chips in their product lines and they have become mainstream, with the number of cores increasing in each processor generation. Multi-core processors improve the performance of applications by exploiting Thread Level Parallelism (TLP) while the Instruction Level Parallelism (ILP) exploited by each individual core is limited. These architectures are very efficient when multiple threads are available for execution. However, single-thread sections of code (single-thread applications and serial sections of parallel applications) pose important constraints on the benefits achieved by parallel execution, as pointed out by Amdahl’s law. Parallel programming, even with the help of recently proposed techniques like transactional memory, has proven to be a very challenging task. On the other hand, automatically partitioning applications into threads may be a straightforward task in regular applications, but becomes much harder for irregular programs, where compilers usually fail to discover sufficient TLP. In this scenario, two main directions have been followed in the research community to take benefit of multi-core platforms: Speculative Multithreading (SpMT) and Non-Speculative Clustered architectures. The former splits a sequential application into speculative threads, while the later partitions the instructions among the cores based on data-dependences but avoid large degree of speculation. Despite the large amount of research on both these approaches, the proposed techniques so far have shown marginal performance improvements. In this thesis we propose novel schemes to speed-up sequential or lightly threaded applications in multi-core processors that effectively address the main unresolved challenges of previous approaches. In particular, we propose a SpMT architecture, called Mitosis, that leverages a powerful software value prediction technique to manage inter-thread dependences, based on pre-computation slices (p-slices). Thanks to the accuracy and low cost of this technique, Mitosis is able to effectively parallelize applications even in the presence of frequent dependences among threads. We also propose a novel architecture, called Anaphase, that combines the best of SpMT schemes and clustered architectures. Anaphase effectively exploits ILP, TLP and Memory Level Parallelism (MLP), thanks to its unique finegrain thread decomposition algorithm that adapts to the available parallelism in the application

    비휘발성 메모리 기반의 최종 레벨 캐시를 위한 쓰기 회피 기법

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 컴퓨터공학부, 2016. 2. 신현식.Non-volatile memory (NVM) is considered to be a promising memory technology for last-level caches (LLC) due to its low leakage of power and high storage density. However, NVM has some drawbacks including high dynamic energy when modifying NVM cells, long latency for write operations, and limited write endurance. To overcome these problems, the thesis focuses on two approaches: cache coherence and NVM capacity management policy for hybrid cache architecture (HCA). First, we review existing cache coherence protocols under the condition of NVM-based LLCs. Our analysis reveals that the LLCs perform unnecessary write operations because legacy protocols have very pay little attention to reducing the number of write accesses to the LLC. Therefore, a write avoidance cache coherence protocol (WACC) is proposed to reduce the number of write operations to the LLC. In addition, novel HCA schemes are proposed to efficiently utilize SRAM in the thesis. Previous studies on HCA have concentrated on detecting write-intensive blocks and placing them into the SRAM ways. However, unlike other studies, a dynamic way adjusting algorithm (DWA) and a linefill-aware cache partitioning (LCP) calculate the optimal size of NVM ways and SRAM ways in order to minimize the NVM write counts and assigning the corresponding number of NVM ways and SRAM ways to cores. The simulation results show that WACC achieves a 13.2% reduction in the dynamic energy consumption. For HCA schemes, the dynamic energy consumption of DWA and LCP is reduced by 26.9% and 37.2%, respectively.I. Introduction 1 1.1 Purpose of the thesis 1 1.2 Background 3 1.3 Motivation 4 1.4 Contributions 5 1.5 Organization of the thesis 8 II. Related work 9 2.1 Hybrid cache architecture 9 2.1.1 Write intensity prediction studies 11 2.1.2 Static approaches 11 2.1.3 Hybrid cache architecture for main memory 12 2.2 Cache partitioning schemes 14 III. Write avoidance cache coherence protocol 15 3.1 Limitation of existing cache coherence protocol 15 3.2 Write avoidance cache coherence protocol 19 IV. NVM capacity management policy for hybrid cache architecture 22 4.1 NVM capacity management policy 22 4.1.1 Concept of NVM capacity management policy 23 4.1.2 Feasibility of NVM capacity management policy 27 4.2 Dynamic way adjusting 37 4.2.1 Maximum stack distance 37 4.2.2 Adjusting the number of NVM ways 41 4.2.3 Algorithm of dynamic way adjusting 42 4.3 Cache partitioning for hybrid cache architecture 46 4.3.1 Linefill-aware cache partitioning 49 4.3.2 Metrics for cache partitioning 50 4.3.3 Algorithm for cache partitioning 59 4.4 Overhead of NVM capacity management policy 68 V. Experimental results 71 5.1 Experimental environment 71 5.2 Write access to NVM 78 5.3 Dynamic energy consumption 85 5.4 Lifetime 90 5.5 Multi-core environment 96 VI. Conclusion 104 6.1 Conclusion 104 6.2 Future work 106 References 107 Abstract in Korean 115Docto
    corecore