43 research outputs found

    Energy-Delay Tradeoff Analysis of ILP-based Compilation Techniques on a VLIW Architecture

    Get PDF
    Energy consumption is becoming an important issue on modern processors, especially on embedded systems. While many architectural solutions to reduce energy consumption exist, software solutions on the other hand mostly rely on performance optimization techniques. The rationale behind this latter approach follows the rule that energy consumption is roughly proportional to the execution time. While some ILP techniques allow to increase performance by eliminating redundant instructions, others however do increase the total instruction count, mitigating their benefit as far as energy consumption is concerned. This paper explores the energy-delay tradeoff of ILP enhancing techniques at the compilation level. The goal is to develop a theoretical understanding of the main energy issues involved by forming ILP blocks. We present an analytical methodology which essentially exploits the variations in program performance to identify conditions leading to energy consumption increase. Our results show that there exists a threshold above which ILP enhancing optimizations may necessarily turn into diminishing energy reduction returns. The proposed tradeoff analysis reveals that this can be mainly attributed to the limited available instruction parallelism of applications which causes wasted computation and, to some extent, machine overhead to start dominating the energy consumption in some scenarios where the ILP is pushed above a given threshold

    A Case for a Complexity-Effective, Width-partitioned Microarchitecture

    Get PDF
    Current superscalar processors feature 64-bit datapaths to execute the program instructions, regardless of their operands size. Our analysis indicates, however, that most executions comprise a large amount (40%) of narrow-width operations; i.e. instructions which exclusively process narrow-width operands and results. We further noticed that these operations are well distributed across a program run. In this paper, we exploit these properties to master the hardware complexity of superscalar processors. We propose a width-partitioned microarchitecture (WPM) to decouple the treatment of narrow-width operations from that of the other program instructions. We split a 4-way issue processor into two clusters: one executing 64-bit operations, load/store and complex operations and the other treating the 16-bit operations. We show that revealing the narrow-width operations to the hardware is sufficient to keep the workload balanced and the communications minimized between clusters. Using a WPM reduces the complexity of several critical processor components: register file and bypass network. A WPM also lowers the complexity of the interconnection fabric since the 16-bit cluster is only able to propagate narrow-width data. We examine simple configurations of WPM while discussing their tradeoffs. We evaluate a speculative heuristic to steer the narrow-width operations towards clusters. A detailed complexity analysis shows using a WPM model saves power and area with a minimal impact on performance. \\ Les processeurs superscalaires actuels implémentent des chemins de données 64-bit pour exécuter les instructions d'un programme, indépendamment de la taille des opérandes. Notre analyse indique toutefois que la plupart des exécutions comportent une fraction considérable (40%) en opérations tronquées ; c-à-d. les instructions manipulant exclusivement des opérandes et des résultats de petite dimension. Nous avons aussi remarqué que les opérations tronquées sont bien distribuées au cours d'une exécution. Cette étude exploite ces propriétés pour maîtriser la complexité des processeurs superscalaires. Pour ce faire, nous proposons une microarchitecture clusterisée (WPM) pour découpler le traitement des opérations tronquées de celui des autres instructions du programme. Nous partitionnons ainsi le processeur entre deux clusters : un cluster 64-bit exécutant les opérations 64-bit, load/store et complexes, et un cluster 16-bit traitant les opérations 16-bit. Considérant les propriétés relatives aux opérations tronquées, nous montrons que révéler ces dernières au matériel est suffisant pour maintenir l'équilibrage des charges et minimiser les communications entre les clusters. Le modèle WPM réduit efficacement la complexité de plusieurs composants critiques du processeur : fichier de registres, réseau de bypass. Ce modèle réduit aussi la complexité du réseau d'interconnexion car le cluster 16-bit peut uniquement propager des données 16-bit. Nous examinons différentes configurations du WPM en discutant de leurs compromis. Nous évaluons une heuristique spéculative pour distribuer les instructions vers les clusters. Une analyse détaillée de la complexité indique que le modèle WPM réduit la consommation et la surface de silicium avec un impact minimal sur les performances

    LASER: Light, Accurate Sharing dEtection and Repair

    Get PDF
    Contention for shared memory, in the forms of true sharing and false sharing, is a challenging performance bug to discover and to repair. Understanding cache contention requires global knowledge of the program\u27s actual sharing behavior, and can even arise invisibly in the program due to the opaque decisions of the memory allocator. Previous schemes have focused only on false sharing, and impose significant performance penalties or require non-trivial alterations to the operating system or runtime system environment. This paper presents the Light, Accurate Sharing dEtection and Repair (LASER) system, which leverages new performance counter capabilities available on Intel\u27s Haswell architecture that identify the source of expensive cache coherence events. Using records of these events generated by the hardware, we build a system for online contention detection and repair that operates with low performance overhead and does not require any invasive program, compiler or operating system changes. Our experiments show that LASER imposes just 2% average runtime overhead on the Phoenix, Parsec and Splash2x benchmarks. LASER can automatically improve the performance of programs by up to 19% on commodity hardware

    MPreplay: Architecture Support for Deterministic Replay of Message Passing Programs on Message Passing Many-Core Processors

    Get PDF
    Coordinated Science Laboratory was formerly known as Control Systems Laborator

    Randomness Concerns When Deploying Differential Privacy

    Full text link
    The U.S. Census Bureau is using differential privacy (DP) to protect confidential respondent data collected for the 2020 Decennial Census of Population & Housing. The Census Bureau's DP system is implemented in the Disclosure Avoidance System (DAS) and requires a source of random numbers. We estimate that the 2020 Census will require roughly 90TB of random bytes to protect the person and household tables. Although there are critical differences between cryptography and DP, they have similar requirements for randomness. We review the history of random number generation on deterministic computers, including von Neumann's "middle-square" method, Mersenne Twister (MT19937) (previously the default NumPy random number generator, which we conclude is unacceptable for use in production privacy-preserving systems), and the Linux /dev/urandom device. We also review hardware random number generator schemes, including the use of so-called "Lava Lamps" and the Intel Secure Key RDRAND instruction. We finally present our plan for generating random bits in the Amazon Web Services (AWS) environment using AES-CTR-DRBG seeded by mixing bits from /dev/urandom and the Intel Secure Key RDSEED instruction, a compromise of our desire to rely on a trusted hardware implementation, the unease of our external reviewers in trusting a hardware-only implementation, and the need to generate so many random bits.Comment: 12 pages plus 2 pages bibliograph

    embedded

    No full text
    Energy-efficiency potential of a phase-based cache resizing scheme fo

    Techniques de compilation pour la gestion et l'optimisation de la consommation d'Ă©nergie des architectures VLIW

    No full text
    Cette thèse s'emploie principalement à réduire la consommation d'énergie des architectures VLIW tout en essayant de préserver le maximum de performance possible. Contrairement à certaines approches logicielles tendant à favoriser l'optimisation de code pour obtenir des gains en énergie, nous présentons des arguments en faveur d'une approche synergique intégrant matériel et logicielà la fois. L'idée principale défendue tout au long de cette thèse repose sur le fait que seule une compréhension avançée du comportement dynamique d'un programme au niveau du compilateur est susceptible de produire un meilleur contrôle de la gestion de la consommation d'énergie. Pour cela, nous introduisons une technique d'analyse statique du comportement dynamique d'un programme afin de parvenir à identifier et à caractériser les chemins les plus fréquemment exécutés d'un programme. L'objectif visé étant la réduction de la consommation d'énergie, nous montrons par la suite que sur directive du compilateur, l'architecture de la machine peut être modifiée pour s'adapter à un état dynamique particulier du programme. Nous présentons les conditions d'une telle reconfiguration ainsi que les éventuelles modifications à apporter à l'architecture, à la fois pour le système des caches que pour le chemin de données d'un processeur VLIW.RENNES1-BU Sciences Philo (352382102) / SudocSudocFranceF

    Energy reduction potential of a phase-based cache resizing scheme for embedded systems

    Get PDF
    Managing the energy-performance tradeoff has become a major challenge on embedded systems. The cache hierarchy is a typical example of such a design target where this tradeoff plays a central role. With the increasing level of integration density, a cache can feature several billions of transistors, consuming a large proportion of the energy. In the same time however, it also allows to considerably improve the performance. Configurable caches are becoming the de-facto solution to deal efficiently with that problem. Such caches are equipped with artifacts that enable one to resize it dynamically. In the context of embedded systems, however, many of these artifacts restrict the configurability at the application level. We propose in this paper to modify the structure of a configurable cache to offer embedded compilers the opportunity to reconfigure it according to a program dynamic phase, rather than on a per-application basis. We show in our experimental results that the proposed scheme has a potential for improving the compiler effectiveness to reduce the energy consumption, while not excessively degrading the performance

    Energy-Delay Tradeoff Analysis of ILP-based Compilation Techniques on a VLIW Architecture

    Get PDF
    Energy consumption is becoming an important issue on modern processors, especially on embedded systems. While many architectural solutions to reduce energy consumption exist, software solutions on the other hand mostly rely on performance optimization techniques. The rationale behind this latter approach follows the rule that energy consumption is roughly proportional to the execution time. While some ILP techniques allow to increase performance by eliminating redundant instructions, others however do increase the total instruction count, mitigating their benefit as far as energy consumption is concerned. This paper explores the energy-delay tradeoff of ILP enhancing techniques at the compilation level. The goal is to develop a theoretical understanding of the main energy issues involved by forming ILP blocks. We present an analytical methodology which essentially exploits the variations in program performance to identify conditions leading to energy consumption increase. Our results show that there exists a threshold above which ILP enhancing optimizations may necessarily turn into diminishing energy reduction returns. The proposed tradeoff analysis reveals that this can be mainly attributed to the limited available instruction parallelism of applications which causes wasted computation and, to some extent, machine overhead to start dominating the energy consumption in some scenarios where the ILP is pushed above a given threshold

    BugNet: Continuously recording program execution for deterministic replay debugging

    Get PDF
    Significant time is spent by companies trying to reproduce and fix the bugs that occur for released code. To assist developers, we propose the BugNet architecture to continuously record information on production runs. The information collected before the crash of a program can be used by the developers working in their execution environment to deterministically replay the last several million instructions executed before the crash. BugNet is based on the insight that recording the register file contents at any point in time, and then recording the load values that occur after that point can enable deterministic replaying of a program’s execution. BugNet focuses on being able to replay the application’s execution and the libraries it uses, but not the operating system. But our approach provides the ability to replay an application’s execution across context switches and interrupts. Hence, BugNet obviates the need for tracking program I/O, interrupts and DMA transfers, which would have otherwise required more complex hardware support. In addition, BugNet does not require a final core dump of the system state for replaying, which significantly reduces the amount of data that must be sent back to the developer.
    corecore