19,325 research outputs found

    Software directed issue queue power reduction

    Get PDF
    The issue logic of a superscalar processor dissipates a large amount of static and dynamic power. Furthermore, its power density makes it a hot-spot requiring expensive cooling systems and additional packaging. In this paper we present a novel software assisted approach to power reduction where the processor dynamically resizes the issue queue based on compiler analysis. The compiler passes information to the processor about the number of entries needed which limits the number of instructions dispatched and resident in the queue. This saves power without adversely affecting performance. Compared with recently proposed hardware techniques, our approach is faster, simpler and saves more power. Using a simplistic scheme we achieve 47% dynamic and 31% static power savings in the issue queue with only a 2.2% performance loss. We then show that the performance loss can be reduced to less than 1.3% with 45% dynamic and 30% static power savings, outperforming all current approaches.Postprint (published version

    Adaptive runtime-assisted block prefetching on chip-multiprocessors

    Get PDF
    Memory stalls are a significant source of performance degradation in modern processors. Data prefetching is a widely adopted and well studied technique used to alleviate this problem. Prefetching can be performed by the hardware, or be initiated and controlled by software. Among software controlled prefetching we find a wide variety of schemes, including runtime-directed prefetching and more specifically runtime-directed block prefetching. This paper proposes a hybrid prefetching mechanism that integrates a software driven block prefetcher with existing hardware prefetching techniques. Our runtime-assisted software prefetcher brings large blocks of data on-chip with the support of a low cost hardware engine, and synergizes with existing hardware prefetchers that manage locality at a finer granularity. The runtime system that drives the prefetch engine dynamically selects which cache to prefetch to. Our evaluation on a set of scientific benchmarks obtains a maximum speed up of 32 and 10 % on average compared to a baseline with hardware prefetching only. As a result, we also achieve a reduction of up to 18 and 3 % on average in energy-to-solution.Peer ReviewedPostprint (author's final draft

    Late allocation and early release of physical registers

    Get PDF
    The register file is one of the critical components of current processors in terms of access time and power consumption. Among other things, the potential to exploit instruction-level parallelism is closely related to the size and number of ports of the register file. In conventional register renaming schemes, both register allocation and releasing are conservatively done, the former at the rename stage, before registers are loaded with values, and the latter at the commit stage of the instruction redefining the same register, once registers are not used any more. We introduce VP-LAER, a renaming scheme that allocates registers later and releases them earlier than conventional schemes. Specifically, physical registers are allocated at the end of the execution stage and released as soon as the processor realizes that there will be no further use of them. VP-LAER enhances register utilization, that is, the fraction of allocated registers having a value to be read in the future. Detailed cycle-level simulations show either a significant speedup for a given register file size or a reduction in the register file size for a given performance level, especially for floating-point codes, where the register file pressure is usually high.Peer ReviewedPostprint (published version

    Type-Directed Program Transformations for the Working Functional Programmer

    Get PDF
    We present preliminary research on Deuce+, a set of tools integrating plain text editing with structural manipulation that brings the power of expressive and extensible type-directed program transformations to everyday, working programmers without a background in computer science or mathematical theory. Deuce+ comprises three components: (i) a novel set of type-directed program transformations, (ii) support for syntax constraints for specifying "code style sheets" as a means of flexibly ensuring the consistency of both the concrete and abstract syntax of the output of program transformations, and (iii) a domain-specific language for specifying program transformations that can operate at a high level on the abstract (and/or concrete) syntax tree of a program and interface with syntax constraints to expose end-user options and alleviate tedious and potentially mutually inconsistent style choices. Currently, Deuce+ is in the design phase of development, and discovering the right usability choices for the system is of the highest priority

    European ALMA operations: the interaction with and support to the users

    Full text link
    The Atacama Large Millimetre/submillimetre Array (ALMA) is one of the largest and most complicated observatories ever built. Constructing and operating an observatory at high altitude (5000m) in a cost effective and safe manner, with minimal effect on the environment creates interesting challenges. Since the array will have to adapt quickly to prevailing weather conditions, ALMA will be operated exclusively in service mode. By the time of full science operations, the fundamental ALMA data product shall be calibrated, deconvolved data cubes and images, but raw data and data reduction software will be made available to users as well. User support is provided by the ALMA Regional Centres (ARCs) located in Europe, North America and Japan. These ARCs constitute the interface between the user community and the ALMA observatory in Chile. For European users the European ARC is being set up as a cluster of nodes located throughout Europe, with the main centre at the ESO Headquarters in Garching. The main centre serves as the access portal and in synergy with the distributed network of ARC nodes, the main aim of the ARC is to optimize the ALMA science output and to fully exploit this unique and powerful facility. The aim of this article is to introduce the process of proposing for observing time, subsequent execution of the observations, obtaining and processing of the data in the ALMA epoch. The complete end-to-end process of the ALMA data flow from the proposal submission to the data delivery is described.Comment: 7 pages, three figure

    Hardware schemes for early register release

    Get PDF
    Register files are becoming one of the critical components of current out-of-order processors in terms of delay and power consumption, since their potential to exploit instruction-level parallelism is quite related to the size and number of ports of the register file. In conventional register renaming schemes, register releasing is conservatively done only after the instruction that redefines the same register is committed. Instead, we propose a scheme that releases registers as soon as the processor knows that there will be no further use of them. We present two early releasing hardware implementations with different performance/complexity trade-offs. Detailed cycle-level simulations show either a significant speedup for a given register file size, or a reduction in register file size for a given performance level.Peer ReviewedPostprint (published version

    Architectural support for task dependence management with flexible software scheduling

    Get PDF
    The growing complexity of multi-core architectures has motivated a wide range of software mechanisms to improve the orchestration of parallel executions. Task parallelism has become a very attractive approach thanks to its programmability, portability and potential for optimizations. However, with the expected increase in core counts, finer-grained tasking will be required to exploit the available parallelism, which will increase the overheads introduced by the runtime system. This work presents Task Dependence Manager (TDM), a hardware/software co-designed mechanism to mitigate runtime system overheads. TDM introduces a hardware unit, denoted Dependence Management Unit (DMU), and minimal ISA extensions that allow the runtime system to offload costly dependence tracking operations to the DMU and to still perform task scheduling in software. With lower hardware cost, TDM outperforms hardware-based solutions and enhances the flexibility, adaptability and composability of the system. Results show that TDM improves performance by 12.3% and reduces EDP by 20.4% on average with respect to a software runtime system. Compared to a runtime system fully implemented in hardware, TDM achieves an average speedup of 4.2% with 7.3x less area requirements and significant EDP reductions. In addition, five different software schedulers are evaluated with TDM, illustrating its flexibility and performance gains.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P, TIN2016-76635-C2-2-R and TIN2016-81840-REDT), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), and by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 671697 and No. 671610. M. Moretó has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047.Peer ReviewedPostprint (author's final draft
    corecore