4 research outputs found

    Génération dynamique de code pour l'optimisation énergétique

    Get PDF
    In computing systems, energy consumption is limiting the performance growth experienced in the last decades. Consequently, computer architecture and software development paradigms will have to change if we want to avoid a performance stagnation in the next decades.In this new scenario, new architectural and micro-architectural designs can offer the possibility to increase the energy efficiency of hardware, thanks to hardware specialization, such as heterogeneous configurations of cores, new computing units and accelerators. On the other hand, with this new trend, software development should cope with the lack of performance portability to ever changing hardware and with the increasing gap between the performance that programmers can extract and the maximum achievable performance of the hardware. To address this issue, this thesis contributes by proposing a methodology and proof of concept of a run-time auto-tuning framework for embedded systems. The proposed framework can both adapt code to a micro-architecture unknown prior compilation and explore auto-tuning possibilities that are input-dependent.In order to study the capability of the proposed approach to adapt code to different micro-architectural configurations, I developed a simulation framework of heterogeneous in-order and out-of-order ARM cores. Validation experiments demonstrated average absolute timing errors around 7 % when compared to real ARM Cortex-A8 and A9, and relative energy/performance estimations within 6 % for the Dhrystone 2.1 benchmark when compared to Cortex-A7 and A15 (big.LITTLE) CPUs.An important component of the run-time auto-tuning framework is a run-time code generation tool, called deGoal. It defines a low-level dynamic DSL for computing kernels. During this thesis, I ported deGoal to the ARM Thumb-2 ISA and added new features for run-time auto-tuning. A preliminary validation in ARM processors showed that deGoal can in average generate equivalent or higher quality machine code compared to programs written in C, including manually vectorized codes.The methodology and proof of concept of run-time auto-tuning in embedded processors were developed around two kernel-based applications, extracted from the PARSEC 3.0 suite and its hand vectorized version PARVEC. In the favorable application, average speedups of 1.26 and 1.38 were obtained in real and simulated cores, respectively, going up to 1.79 and 2.53 (all run-time overheads included). I also demonstrated through simulations that run-time auto-tuning of SIMD instructions to in-order cores can outperform the reference vectorized code run in similar out-of-order cores, with an average speedup of 1.03 and energy efficiency improvement of 39 %. The unfavorable application was chosen to show that the proposed approach has negligible overheads when better kernel versions can not be found. When both applications run in real hardware, the run-time auto-tuning performance is in average only 6 % way from the performance obtained by the best statically found kernel implementations.Dans les systĂšmes informatiques, la consommation Ă©nergĂ©tique est devenue le facteur le plus limitant de la croissance de performance observĂ©e pendant les dĂ©cennies prĂ©cĂ©dentes. ConsĂ©quemment, les paradigmes d'architectures d'ordinateur et de dĂ©veloppement logiciel doivent changer si nous voulons Ă©viter une stagnation de la performance durant les dĂ©cennies Ă  venir.Dans ce nouveau scĂ©nario, des nouveaux designs architecturaux et micro-architecturaux peuvent offrir des possibilitĂ©s d'amĂ©liorer l'efficacitĂ© Ă©nergĂ©tique des ordinateurs, grĂące Ă  la spĂ©cialisation matĂ©rielle, comme par exemple les configurations de cƓurs hĂ©tĂ©rogĂšnes, des nouvelles unitĂ©s de calcul et des accĂ©lĂ©rateurs. D'autre part, avec cette nouvelle tendance, le dĂ©veloppement logiciel devra faire face au manque de portabilitĂ© de la performance entre les matĂ©riels toujours en Ă©volution et Ă  l'Ă©cart croissant entre la performance exploitĂ©e par les programmeurs et la performance maximale exploitable du matĂ©riel. Pour traiter ce problĂšme, la contribution de cette thĂšse est une mĂ©thodologie et la preuve de concept d'un cadriciel d'auto-tuning Ă  la volĂ©e pour les systĂšmes embarquĂ©s. Le cadriciel proposĂ© peut Ă  la fois adapter du code Ă  une micro-architecture inconnue avant la compilation et explorer des possibilitĂ©s d'auto-tuning qui dĂ©pendent des donnĂ©es d'entrĂ©e d'un programme.Dans le but d'Ă©tudier la capacitĂ© de l'approche proposĂ©e Ă  adapter du code Ă  des diffĂ©rentes configurations micro-architecturales, j'ai dĂ©veloppĂ© un cadriciel de simulation de processeurs hĂ©tĂ©rogĂšnes ARM avec exĂ©cution dans l'ordre ou dans le dĂ©sordre, basĂ© sur les simulateurs gem5 et McPAT. Les expĂ©rimentations de validation ont dĂ©montrĂ© en moyenne des erreurs absolues temporels autour de 7 % comparĂ© aux ARM Cortex-A8 et A9, et une estimation relative d'Ă©nergie et de performance Ă  6 % prĂšs pour le benchmark Dhrystone 2.1 comparĂ©e Ă  des CPUs Cortex-A7 et A15 (big.LITTLE). Les rĂ©sultats de validation temporelle montrent que gem5 est beaucoup plus prĂ©cis que les simulateurs similaires existants, dont les erreurs moyennes sont supĂ©rieures Ă  15 %.Un composant important du cadriciel d'auto-tuning Ă  la volĂ©e proposĂ© est un outil de gĂ©nĂ©ration dynamique de code, appelĂ© deGoal. Il dĂ©finit un langage dĂ©diĂ© dynamique et bas-niveau pour les noyaux de calcul. Pendant cette thĂšse, j'ai portĂ© deGoal au jeu d'instructions ARM Thumb-2 et crĂ©Ă© des nouvelles fonctionnalitĂ©s pour l'auto-tuning Ă  la volĂ©e. Une validation prĂ©liminaire dans des processeurs ARM ont montrĂ© que deGoal peut en moyenne gĂ©nĂ©rer du code machine avec une qualitĂ© Ă©quivalente ou supĂ©rieure comparĂ© aux programmes de rĂ©fĂ©rence Ă©crits en C, et mĂȘme par rapport Ă  du code vectorisĂ© Ă  la main.La mĂ©thodologie et la preuve de concept de l'auto-tuning Ă  la volĂ©e dans des processeurs embarquĂ©s ont Ă©tĂ© dĂ©veloppĂ©es autour de deux applications basĂ©es sur noyau de calcul, extraits de la suite de benchmark PARSEC 3.0 et de sa version vectorisĂ©e Ă  la main PARVEC.Dans l'application favorable, des accĂ©lĂ©rations de 1.26 et de 1.38 ont Ă©tĂ© observĂ©es sur des cƓurs rĂ©els et simulĂ©s, respectivement, jusqu'Ă  1.79 et 2.53 (toutes les surcharges dynamiques incluses).J'ai aussi montrĂ© par la simulation que l'auto-tuning Ă  la volĂ©e d'instructions SIMD aux cƓurs d'exĂ©cution dans l'ordre peut surpasser le code de rĂ©fĂ©rence vectorisĂ© exĂ©cutĂ© par des cƓurs d'exĂ©cution dans le dĂ©sordre similaires, avec une accĂ©lĂ©ration moyenne de 1.03 et une amĂ©lioration de l'efficacitĂ© Ă©nergĂ©tique de 39 %.L'application dĂ©favorable a Ă©tĂ© choisie pour montrer que l'approche proposĂ©e a une surcharge nĂ©gligeable lorsque des versions de noyau plus performantes ne peuvent pas ĂȘtre trouvĂ©es.En faisant tourner les deux applications sur les processeurs rĂ©els, la performance de l'auto-tuning Ă  la volĂ©e est en moyenne seulement 6 % en dessous de la performance obtenue par la meilleure implĂ©mentation de noyau trouvĂ©e statiquement

    Compilers for leakage power reduction

    No full text
    Power leakage constitutes an increasing fraction of the total power consumption in modern semiconductor technologies. Recent research efforts indicate that architectures, compilers, and software can be optimized so as to reduce the switching power (also known as dynamic power) in microprocessors. This has lead to interest in using architecture and compiler optimization to reduce leakage power (also known as static power) in microprocessors. In this paper, we investigate compiler-analysis techniques that are related to reducing leakage power. The architecture model in our design is a system with an instruction set to support the control of power gating at the component level. Our compiler provides an analysis framework for utilizing instructions to reduce the leakage power. We present a framework for analyzing data flow for estimating the component activities at fixed points of programs whilst considering pipeline architectures. We also provide equations that can be used by the compiler to determine whether employing power-gating instructions in given program blocks will reduce the total energy requirements. As the duration of power gating on components when executing given program routines is related to the number and complexity of program branches, we propose a set of scheduling policies and evaluate their effectiveness. We performed experiments by incorporating our compiler analysis and scheduling policies into SUIF compiler tools and by simulating the energy consumptions on Wattch toolkits. The experimental results demonstrate that our mechanisms are effective in reducing leakage power in microprocessors
    corecore