63 research outputs found

    BarrierPoint: sampled simulation of multi-threaded applications

    Get PDF
    Sampling is a well-known technique to speed up architectural simulation of long-running workloads while maintaining accurate performance predictions. A number of sampling techniques have recently been developed that extend well- known single-threaded techniques to allow sampled simulation of multi-threaded applications. Unfortunately, prior work is limited to non-synchronizing applications (e.g., server throughput workloads); requires the functional simulation of the entire application using a detailed cache hierarchy which limits the overall simulation speedup potential; leads to different units of work across different processor architectures which complicates performance analysis; or, requires massive machine resources to achieve reasonable simulation speedups. In this work, we propose BarrierPoint, a sampling methodology to accelerate simulation by leveraging globally synchronizing barriers in multi-threaded applications. BarrierPoint collects microarchitecture-independent code and data signatures to determine the most representative inter-barrier regions, called barrierpoints. BarrierPoint estimates total application execution time (and other performance metrics of interest) through detailed simulation of these barrierpoints only, leading to substantial simulation speedups. Barrierpoints can be simulated in parallel, use fewer simulation resources, and define fixed units of work to be used in performance comparisons across processor architectures. Our evaluation of BarrierPoint using NPB and Parsec benchmarks reports average simulation speedups of 24.7x (and up to 866.6x) with an average simulation error of 0.9% and 2.9% at most. On average, BarrierPoint reduces the number of simulation machine resources needed by 78x

    NPS: A Framework for Accurate Program Sampling Using Graph Neural Network

    Full text link
    With the end of Moore's Law, there is a growing demand for rapid architectural innovations in modern processors, such as RISC-V custom extensions, to continue performance scaling. Program sampling is a crucial step in microprocessor design, as it selects representative simulation points for workload simulation. While SimPoint has been the de-facto approach for decades, its limited expressiveness with Basic Block Vector (BBV) requires time-consuming human tuning, often taking months, which impedes fast innovation and agile hardware development. This paper introduces Neural Program Sampling (NPS), a novel framework that learns execution embeddings using dynamic snapshots of a Graph Neural Network. NPS deploys AssemblyNet for embedding generation, leveraging an application's code structures and runtime states. AssemblyNet serves as NPS's graph model and neural architecture, capturing a program's behavior in aspects such as data computation, code path, and data flow. AssemblyNet is trained with a data prefetch task that predicts consecutive memory addresses. In the experiments, NPS outperforms SimPoint by up to 63%, reducing the average error by 38%. Additionally, NPS demonstrates strong robustness with increased accuracy, reducing the expensive accuracy tuning overhead. Furthermore, NPS shows higher accuracy and generality than the state-of-the-art GNN approach in code behavior learning, enabling the generation of high-quality execution embeddings

    A Sampling Method Focusing on Practicality

    Get PDF
    In the past few years, several research works have demonstrated that sampling can drastically speed up architecture simulation, and several of these sampling techniques are already largely used. However, for a sampling technique to be both easily and properly used, i.e., plugged and reliably used into many simulators with little or no effort or knowledge from the user, it must fulfill a number of conditions: it should require no hardware-dependent modification of the functional or timing simulator, it should simultaneously consider warm-up and sampling, while still delivering high speed and accuracy.\\ \indent The motivation for this article is that, with the advent of generic and modular simulation frameworks like ASIM, SystemC, LSE, MicroLib or UniSim, there is a need for sampling techniques with the aforementioned properties, i.e., which are almost entirely \emph{transparent} to the user and simulator agnostic. In this article, we propose a sampling technique focused more on transparency than on speed and accuracy, though the technique delivers almost state-of-the-art performance. Our sampling technique is a hardware-independent and integrated approach to warm-up and sampling; it requires no modification of the functional simulator and solely relies on the performance simulator for warm-up. We make the following contributions: (1) a technique for splitting the execution trace into a potentially very large number of variable-size regions to capture program dynamic control flow, (2) a clustering method capable of efficiently coping with such a large number of regions, (3) a \emph{budget}-based method for jointly considering warm-up and sampling costs, presenting them as a single parameter to the user, and for distributing the number of simulated instructions between warm-up and sampling based on the region partitioning and clustering information.\newline \indent Overall, the method achieves an accuracy/time tradeoff that is close to the best reported results using clustering-based sampling (though usually with perfect or hardware-dependent warm-up), with an average CPI error of 1.68\% and an average number of simulated instructions of 288 million instructions over the Spec benchmarks. The technique/tool can be readily applied to a wide range of benchmarks, architectures and simulators, and will be used as a sampling option of the UniSim modular simulation framework

    Descobrindo o comportamento de fases através do agrupamento de características independentes de microarquitetura variantes no tempo

    Get PDF
    Orientador: Rodolfo Jardim de AzevedoDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: A análise de fases provou-se uma técnica eficiente para diminuir o tempo necessário para executar simulações detalhadas de microarquitetura. O objetivo deste estudo é solucionar duas dificuldades do estado da arte: (i) a maioria das abordagens feitas na análise de fases adota uma estratégia de granularidade fina, que em alguns casos pode ser interferida por ruídos temporários e não levar em conta um contexto mais amplo; e (ii) a interpretação da assinatura de cada fase de programa é uma tarefa difícil, dado que muitas vezes são empregadas assinaturas de alta dimensão. Para a problemática (i) adotamos a análise de fases de programas em dois níveis, cada qual com uma granularidade diferente (nível 1 -- método de agrupamento de subsequências de séries temporais multivariadas; nível 2 -- kk-means). No entanto, concluímos que essa abordagem alcançou uma precisão comparável aos trabalhos anteriores. Chegamos então ao estado da arte de forma alternativa, mas com a vantagem de trazer subsídios para uma potencial solução para a problemática (ii), pois com o método empregado, as fases passaram a ter uma assinatura (MRF) muito mais interpretável, além de alinhada ao comportamento dos programas. Demonstramos a eficácia dessa interpretação usando uma medida de centralidade para identificar as principais características de uma fase de programa, contribuindo assim para o uso dessas assinaturas (MRF) de fases em estudos posterioresAbstract: Phase analysis has been shown to be an efficient technique to decrease the time needed to execute detailed micro-architectural simulations. Our study aimed to overcome two limitations of current methods that can be defined as follows: (i) most approaches adopt a fine-grained strategy, which in some cases can be interfered with temporary noises and do not account for a broader context; and (ii) interpreting the resulting program phases is often difficult since it is hard to draw meaningful conclusions from high-dimensional phase signatures. Regarding (i), we adopted a two-level phase analysis, each with different granularity (level 1 -- method of subsequence clustering of multivariate time series; level 2 -- k k -means). However, we found that, on average, this sampling approach achieved comparable accuracy in phase classification to prior work. Thus, we achieved state-of-the-art precision with a potential solution to the problem (ii), since with the method employed, the phases started to have a much more interpretable signature (MRF), in addition to be closely aligned with the behavior of a program. We demonstrated the effectiveness of such interpretation using a centrality measure to identify the most important characteristics within a program phaseMestradoCiência da ComputaçãoMestre em Ciência da Computação131024/2017-5CNP

    The multi-program performance model: debunking current practice in multi-core simulation

    Get PDF
    Composing a representative multi-program multi-core workload is non-trivial. A multi-core processor can execute multiple independent programs concurrently, and hence, any program mix can form a potential multi-program workload. Given the very large number of possible multiprogram workloads and the limited speed of current simulation methods, it is impossible to evaluate all possible multi-program workloads. This paper presents the Multi-Program Performance Model (MPPM), a method for quickly estimating multiprogram multi-core performance based on single-core simulation runs. MPPM employs an iterative method to model the tight performance entanglement between co-executing programs on a multi-core processor with shared caches. Because MPPM involves analytical modeling, it is very fast, and it estimates multi-core performance for a very large number of multi-program workloads in a reasonable amount of time. In addition, it provides confidence bounds on its performance estimates. Using SPEC CPU2006 and up to 16 cores, we report an average performance prediction error of 2.3% and 2.9% for system throughput (STP) and average normalized turnaround time (ANTT), respectively, while being up to five orders of magnitude faster than detailed simulation. Subsequently, we demonstrate that randomly picking a limited number of multi-program workloads, as done in current pactice, can lead to incorrect design decisions in practical design and research studies, which is alleviated using MPPM. In addition, MPPM can be used to quickly identify multi-program workloads that stress multi-core performance through excessive conflict behavior in shared caches; these stress workloads can then be used for driving the design process further

    A comparison of cache hierarchies for SMT processors

    Get PDF
    In the multithread and multicore era, programs are forced to share part of the processor structures. On one hand, the state of the art in multithreading describes how efficiently manage and distribute inner resources such as reorder buffer or issue windows. On the other hand, there is a substantial body of works focused on outer resources, mainly on how to effectively share last level caches in multicores. Between these ends, first and second level caches have remained apart even if they are shared in most commercial multithreaded processors. This work analyzes multiprogrammed workloads as the worst-case scenario for cache sharing among threads. In order to obtain representative results, we present a sampling-based methodology that for multiple metrics such as STP, ANTT, IPC throughput, or fairness, reduces simulation time up to 4 orders of magnitude when running 8-thread workloads with an error lower than 3% and a confidence level of 97%. With the above mentioned methodology, we compare several state-of-the-art cache hierarchies, and observe that Light NUCA provides performance benefits in SMT processors regardless the organization of the last level cache. Most importantly, Light NUCA gains are consistent across the entire number of simulated threads, from one to eight.Peer ReviewedPostprint (author's final draft

    DrPin : um instrumentador dinâmico de binários para múltiplas arquiteturas de processadores

    Get PDF
    Orientador: Rodolfo Jardim de AzevedoDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: A complexidade dos programas está aumentando e as ferramentas usadas para seu desenvolvimento tem acompanhado tal evolução. Aplicações modernas dependem largamente de bibliotecas carregadas dinamicamente e algumas aplicações até geram código durante sua execução. Logo, ferramentas de análise estática, usadas para depurar e entender aplicações não são mais suficientes para se ter um panorama completo de uma aplicação. Como resultado, ferramentas de análise dinâmica (aquelas que são executadas durante o tempo de execução) estão sendo adotadas e integradas ao desenvolvimento e estudo de aplicacoes modernas. Entre essas, as ferramentas que operam diretamente no binário do programa são particularmente úteis no meio de inúmeras bibliotecas carregadas dinamicamente, onde o código-fonte pode não estar disponível. A construção de ferramentas que manipulam e instrumentam código binário durante sua execução é particularmente difícil e propensa a erros. Um pequeno erro pode resultar em um desvio completo do comportamento do programa sendo analisado. Por esse motivo, frameworks de Instrumentação dinâmica de binários (DBI) tornaram-se cada vez mais populares. Esses frameworks fornecem meios para criação de ferramentas de análise dinâmica de binarios com pouco esforço. Entre eles, o Pin 2 tem sido de longe o mais popular e fácil de usar. No entanto, desde o lançamento da série 4 do Linux Kernel, ele ficou sem suporte. Neste trabalho, nosso foco é voltado para o estudo dos desafios encontrados ao criar um novo DBI (DrPin) que tem como foco ser totalmente compatível com a API do Pin 2, ao mesmo tempo que também suporta várias arquiteturas (x86-64, x86, Arm, Aarch64) e sistemas Linux modernos. Atualmente, o DrPin suporta um total de 83 funções da API do Pin 2, o que o torna capaz de executar várias pintools originalmente escritas para o Pin 2 sem nenhuma modificação. Comparando o desempenho do DrPin com o Pin 2, para uma ferramenta simples que conta o número de instruções executadas, observamos que, para o benchmark SPECint 2006, somos, em média, apenas 10% mais lentos que o Pin e 11,6 vezes mais lentos que a execução nativa. Também exploramos um pouco o ecossistema em torno dos frameworks de instrumentação dinâmica de binários. Especificamente, estudamos e estendemos uma técnica que utiliza ferramentas de análise dinâmicas de binários, construida com a ajuda de frameworks DBI, para prever o desempenho de uma determinada arquitetura ao executar um programa ou benchmark específico, sem a necessidade de executar o programa ou benchmark inteiro. Em particular, estendemos a Metodologia SimPoint para obter ganhos adicionais na redução do tempo necessário para obter tais previsões. Mostramos que, considerando as semelhanças no comportamento do programa entre diferentes entradas, podemos reduzir ainda mais o tempo necessário para obter resultados de simulação de benchmarks inteiros. Especificamente para SPECint 2006, mostramos que o número de SimPoints (diretamente proporcional ao tempo de simulação) pode ser reduzido em média 32%, perdendo apenas 0,06% da precisão quando comparado a técnica original. Diminuindo a precisão em 0,5%, observamos que o tempo de simulação é reduzido em média 66%Abstract: Programs' complexity is rising and the tools used in their development changed to keep up with this evolution. Modern applications rely heavily on dynamically loaded shared libraries and some of them even generate code at runtime, therefore static analysis tools used to debug and understand applications are no longer sufficient to understand the full picture of an application. As a consequence, dynamic analysis tools (those that are executed during runtime) are being adopted and integrated into the development and study of modern applications. Among those, tools that operate directly on the program binary are particularly useful in the sea of dynamically loaded libraries, where the source code might not be readily available. Building tools that manipulate and instrument binary code at runtime is particularly difficult and error-prone. A minor bug can result in a complete disruption in the behavior of the binary code being analyzed. Because of that, Dynamic Binary Instrumentation (DBI) frameworks have become increasingly popular. Those frameworks provide means of building dynamic binary analysis tools with low effort. Among them, Pin 2 has been by far the most popular and easy to use. However, since the release of the Linux Kernel 4 series, it became unsupported. In this work we focus on studying the challenges faced when building a new DBI (DrPin) that seeks to be compatible with Pin 2 API, without the restrictions of Pin 3, that also runs multiple architectures (x86-64, x86, Arm, Aarch64), and on modern Linux systems. In total, currently, DrPin supports a total of 83 Pin 2 API functions, which makes it capable of running many pintools originally written for the Pin 2 framework without any modification. Comparing the performance of DrPin to the original Pin 2 for a simple tool that counts the number of instructions executed, we observed that for the SPECint 2006 benchmark we were, on average, only 10% slower than the Pin 2 framework and 11.6 times slower than the native execution. We also explored the ecosystem around DBI frameworks. Specifically we studied and extended one technique that makes use of dynamic binary tools, built with the help of DBI frameworks, to predict the performance of a given architecture when executing a particular program or benchmark without the need to run the entire program or benchmark. In particular, we extended the SimPoint Methodology to obtain further gains in the time required to obtain the predictions. We showed that by taking into account similarities in the program behavior among different inputs, we can further reduce the time it takes to get simulation results of entire benchmarks. Specifically for SPECint 2006, we showed that the number of SimPoints (which is directly proportional to the simulation time) can be reduced by an average of 32% while losing only 0.06% of the accuracy when compared to the original technique. Further decreasing the accuracy by 0.5%, we observed the simulation time is reduced by an average of 66%MestradoCiência da ComputaçãoMestre em Ciência da Computação143197/2018-5001CNPQCAPE
    corecore