166 research outputs found

    Optimizing energy-efficiency for multi-core packet processing systems in a compiler framework

    Get PDF
    Network applications become increasingly computation-intensive and the amount of traffic soars unprecedentedly nowadays. Multi-core and multi-threaded techniques are thus widely employed in packet processing system to meet the changing requirement. However, the processing power cannot be fully utilized without a suitable programming environment. The compilation procedure is decisive for the quality of the code. It can largely determine the overall system performance in terms of packet throughput, individual packet latency, core utilization and energy efficiency. The thesis investigated compilation issues in networking domain first, particularly on energy consumption. And as a cornerstone for any compiler optimizations, a code analysis module for collecting program dependency is presented and incorporated into a compiler framework. With that dependency information, a strategy based on graph bi-partitioning and mapping is proposed to search for an optimal configuration in a parallel-pipeline fashion. The energy-aware extension is specifically effective in enhancing the energy-efficiency of the whole system. Finally, a generic evaluation framework for simulating the performance and energy consumption of a packet processing system is given. It accepts flexible architectural configuration and is capable of performingarbitrary code mapping. The simulation time is extremely short compared to full-fledged simulators. A set of our optimization results is gathered using the framework

    Innovative Techniques for Testing and Diagnosing SoCs

    Get PDF
    We rely upon the continued functioning of many electronic devices for our everyday welfare, usually embedding integrated circuits that are becoming even cheaper and smaller with improved features. Nowadays, microelectronics can integrate a working computer with CPU, memories, and even GPUs on a single die, namely System-On-Chip (SoC). SoCs are also employed on automotive safety-critical applications, but need to be tested thoroughly to comply with reliability standards, in particular the ISO26262 functional safety for road vehicles. The goal of this PhD. thesis is to improve SoC reliability by proposing innovative techniques for testing and diagnosing its internal modules: CPUs, memories, peripherals, and GPUs. The proposed approaches in the sequence appearing in this thesis are described as follows: 1. Embedded Memory Diagnosis: Memories are dense and complex circuits which are susceptible to design and manufacturing errors. Hence, it is important to understand the fault occurrence in the memory array. In practice, the logical and physical array representation differs due to an optimized design which adds enhancements to the device, namely scrambling. This part proposes an accurate memory diagnosis by showing the efforts of a software tool able to analyze test results, unscramble the memory array, map failing syndromes to cell locations, elaborate cumulative analysis, and elaborate a final fault model hypothesis. Several SRAM memory failing syndromes were analyzed as case studies gathered on an industrial automotive 32-bit SoC developed by STMicroelectronics. The tool displayed defects virtually, and results were confirmed by real photos taken from a microscope. 2. Functional Test Pattern Generation: The key for a successful test is the pattern applied to the device. They can be structural or functional; the former usually benefits from embedded test modules targeting manufacturing errors and is only effective before shipping the component to the client. The latter, on the other hand, can be applied during mission minimally impacting on performance but is penalized due to high generation time. However, functional test patterns may benefit for having different goals in functional mission mode. Part III of this PhD thesis proposes three different functional test pattern generation methods for CPU cores embedded in SoCs, targeting different test purposes, described as follows: a. Functional Stress Patterns: Are suitable for optimizing functional stress during I Operational-life Tests and Burn-in Screening for an optimal device reliability characterization b. Functional Power Hungry Patterns: Are suitable for determining functional peak power for strictly limiting the power of structural patterns during manufacturing tests, thus reducing premature device over-kill while delivering high test coverage c. Software-Based Self-Test Patterns: Combines the potentiality of structural patterns with functional ones, allowing its execution periodically during mission. In addition, an external hardware communicating with a devised SBST was proposed. It helps increasing in 3% the fault coverage by testing critical Hardly Functionally Testable Faults not covered by conventional SBST patterns. An automatic functional test pattern generation exploiting an evolutionary algorithm maximizing metrics related to stress, power, and fault coverage was employed in the above-mentioned approaches to quickly generate the desired patterns. The approaches were evaluated on two industrial cases developed by STMicroelectronics; 8051-based and a 32-bit Power Architecture SoCs. Results show that generation time was reduced upto 75% in comparison to older methodologies while increasing significantly the desired metrics. 3. Fault Injection in GPGPU: Fault injection mechanisms in semiconductor devices are suitable for generating structural patterns, testing and activating mitigation techniques, and validating robust hardware and software applications. GPGPUs are known for fast parallel computation used in high performance computing and advanced driver assistance where reliability is the key point. Moreover, GPGPU manufacturers do not provide design description code due to content secrecy. Therefore, commercial fault injectors using the GPGPU model is unfeasible, making radiation tests the only resource available, but are costly. In the last part of this thesis, we propose a software implemented fault injector able to inject bit-flip in memory elements of a real GPGPU. It exploits a software debugger tool and combines the C-CUDA grammar to wisely determine fault spots and apply bit-flip operations in program variables. The goal is to validate robust parallel algorithms by studying fault propagation or activating redundancy mechanisms they possibly embed. The effectiveness of the tool was evaluated on two robust applications: redundant parallel matrix multiplication and floating point Fast Fourier Transform

    GƩnƩration dynamique de code pour l'optimisation ƩnergƩtique

    Get PDF
    In computing systems, energy consumption is limiting the performance growth experienced in the last decades. Consequently, computer architecture and software development paradigms will have to change if we want to avoid a performance stagnation in the next decades.In this new scenario, new architectural and micro-architectural designs can offer the possibility to increase the energy efficiency of hardware, thanks to hardware specialization, such as heterogeneous configurations of cores, new computing units and accelerators. On the other hand, with this new trend, software development should cope with the lack of performance portability to ever changing hardware and with the increasing gap between the performance that programmers can extract and the maximum achievable performance of the hardware. To address this issue, this thesis contributes by proposing a methodology and proof of concept of a run-time auto-tuning framework for embedded systems. The proposed framework can both adapt code to a micro-architecture unknown prior compilation and explore auto-tuning possibilities that are input-dependent.In order to study the capability of the proposed approach to adapt code to different micro-architectural configurations, I developed a simulation framework of heterogeneous in-order and out-of-order ARM cores. Validation experiments demonstrated average absolute timing errors around 7 % when compared to real ARM Cortex-A8 and A9, and relative energy/performance estimations within 6 % for the Dhrystone 2.1 benchmark when compared to Cortex-A7 and A15 (big.LITTLE) CPUs.An important component of the run-time auto-tuning framework is a run-time code generation tool, called deGoal. It defines a low-level dynamic DSL for computing kernels. During this thesis, I ported deGoal to the ARM Thumb-2 ISA and added new features for run-time auto-tuning. A preliminary validation in ARM processors showed that deGoal can in average generate equivalent or higher quality machine code compared to programs written in C, including manually vectorized codes.The methodology and proof of concept of run-time auto-tuning in embedded processors were developed around two kernel-based applications, extracted from the PARSEC 3.0 suite and its hand vectorized version PARVEC. In the favorable application, average speedups of 1.26 and 1.38 were obtained in real and simulated cores, respectively, going up to 1.79 and 2.53 (all run-time overheads included). I also demonstrated through simulations that run-time auto-tuning of SIMD instructions to in-order cores can outperform the reference vectorized code run in similar out-of-order cores, with an average speedup of 1.03 and energy efficiency improvement of 39 %. The unfavorable application was chosen to show that the proposed approach has negligible overheads when better kernel versions can not be found. When both applications run in real hardware, the run-time auto-tuning performance is in average only 6 % way from the performance obtained by the best statically found kernel implementations.Dans les systĆØmes informatiques, la consommation Ć©nergĆ©tique est devenue le facteur le plus limitant de la croissance de performance observĆ©e pendant les dĆ©cennies prĆ©cĆ©dentes. ConsĆ©quemment, les paradigmes d'architectures d'ordinateur et de dĆ©veloppement logiciel doivent changer si nous voulons Ć©viter une stagnation de la performance durant les dĆ©cennies Ć  venir.Dans ce nouveau scĆ©nario, des nouveaux designs architecturaux et micro-architecturaux peuvent offrir des possibilitĆ©s d'amĆ©liorer l'efficacitĆ© Ć©nergĆ©tique des ordinateurs, grĆ¢ce Ć  la spĆ©cialisation matĆ©rielle, comme par exemple les configurations de cœurs hĆ©tĆ©rogĆØnes, des nouvelles unitĆ©s de calcul et des accĆ©lĆ©rateurs. D'autre part, avec cette nouvelle tendance, le dĆ©veloppement logiciel devra faire face au manque de portabilitĆ© de la performance entre les matĆ©riels toujours en Ć©volution et Ć  l'Ć©cart croissant entre la performance exploitĆ©e par les programmeurs et la performance maximale exploitable du matĆ©riel. Pour traiter ce problĆØme, la contribution de cette thĆØse est une mĆ©thodologie et la preuve de concept d'un cadriciel d'auto-tuning Ć  la volĆ©e pour les systĆØmes embarquĆ©s. Le cadriciel proposĆ© peut Ć  la fois adapter du code Ć  une micro-architecture inconnue avant la compilation et explorer des possibilitĆ©s d'auto-tuning qui dĆ©pendent des donnĆ©es d'entrĆ©e d'un programme.Dans le but d'Ć©tudier la capacitĆ© de l'approche proposĆ©e Ć  adapter du code Ć  des diffĆ©rentes configurations micro-architecturales, j'ai dĆ©veloppĆ© un cadriciel de simulation de processeurs hĆ©tĆ©rogĆØnes ARM avec exĆ©cution dans l'ordre ou dans le dĆ©sordre, basĆ© sur les simulateurs gem5 et McPAT. Les expĆ©rimentations de validation ont dĆ©montrĆ© en moyenne des erreurs absolues temporels autour de 7 % comparĆ© aux ARM Cortex-A8 et A9, et une estimation relative d'Ć©nergie et de performance Ć  6 % prĆØs pour le benchmark Dhrystone 2.1 comparĆ©e Ć  des CPUs Cortex-A7 et A15 (big.LITTLE). Les rĆ©sultats de validation temporelle montrent que gem5 est beaucoup plus prĆ©cis que les simulateurs similaires existants, dont les erreurs moyennes sont supĆ©rieures Ć  15 %.Un composant important du cadriciel d'auto-tuning Ć  la volĆ©e proposĆ© est un outil de gĆ©nĆ©ration dynamique de code, appelĆ© deGoal. Il dĆ©finit un langage dĆ©diĆ© dynamique et bas-niveau pour les noyaux de calcul. Pendant cette thĆØse, j'ai portĆ© deGoal au jeu d'instructions ARM Thumb-2 et crĆ©Ć© des nouvelles fonctionnalitĆ©s pour l'auto-tuning Ć  la volĆ©e. Une validation prĆ©liminaire dans des processeurs ARM ont montrĆ© que deGoal peut en moyenne gĆ©nĆ©rer du code machine avec une qualitĆ© Ć©quivalente ou supĆ©rieure comparĆ© aux programmes de rĆ©fĆ©rence Ć©crits en C, et mĆŖme par rapport Ć  du code vectorisĆ© Ć  la main.La mĆ©thodologie et la preuve de concept de l'auto-tuning Ć  la volĆ©e dans des processeurs embarquĆ©s ont Ć©tĆ© dĆ©veloppĆ©es autour de deux applications basĆ©es sur noyau de calcul, extraits de la suite de benchmark PARSEC 3.0 et de sa version vectorisĆ©e Ć  la main PARVEC.Dans l'application favorable, des accĆ©lĆ©rations de 1.26 et de 1.38 ont Ć©tĆ© observĆ©es sur des cœurs rĆ©els et simulĆ©s, respectivement, jusqu'Ć  1.79 et 2.53 (toutes les surcharges dynamiques incluses).J'ai aussi montrĆ© par la simulation que l'auto-tuning Ć  la volĆ©e d'instructions SIMD aux cœurs d'exĆ©cution dans l'ordre peut surpasser le code de rĆ©fĆ©rence vectorisĆ© exĆ©cutĆ© par des cœurs d'exĆ©cution dans le dĆ©sordre similaires, avec une accĆ©lĆ©ration moyenne de 1.03 et une amĆ©lioration de l'efficacitĆ© Ć©nergĆ©tique de 39 %.L'application dĆ©favorable a Ć©tĆ© choisie pour montrer que l'approche proposĆ©e a une surcharge nĆ©gligeable lorsque des versions de noyau plus performantes ne peuvent pas ĆŖtre trouvĆ©es.En faisant tourner les deux applications sur les processeurs rĆ©els, la performance de l'auto-tuning Ć  la volĆ©e est en moyenne seulement 6 % en dessous de la performance obtenue par la meilleure implĆ©mentation de noyau trouvĆ©e statiquement

    An efficient design space exploration framework to optimize power-efficient heterogeneous many-core multi-threading embedded processor architectures

    Get PDF
    By the middle of this decade, uniprocessor architecture performance had hit a roadblock due to a combination of factors, such as excessive power dissipation due to high operating frequencies, growing memory access latencies, diminishing returns on deeper instruction pipelines, and a saturation of available instruction level parallelism in applications. An attractive and viable alternative embraced by all the processor vendors was multi-core architectures where throughput is improved by using micro-architectural features such as multiple processor cores, interconnects and low latency shared caches integrated on a single chip. The individual cores are often simpler than uniprocessor counterparts, use hardware multi-threading to exploit thread-level parallelism and latency hiding and typically achieve better performance-power figures. The overwhelming success of the multi-core microprocessors in both high performance and embedded computing platforms motivated chip architects to dramatically scale the multi-core processors to many-cores which will include hundreds of cores on-chip to further improve throughput. With such complex large scale architectures however, several key design issues need to be addressed. First, a wide range of micro- architectural parameters such as L1 caches, load/store queues, shared cache structures and interconnection topologies and non-linear interactions between them define a vast non-linear multi-variate micro-architectural design space of many-core processors; the traditional method of using extensive in-loop simulation to explore the design space is simply not practical. Second, to accurately evaluate the performance (measured in terms of cycles per instruction (CPI)) of a candidate design, the contention at the shared cache must be accounted in addition to cycle-by-cycle behavior of the large number of cores which superlinearly increases the number of simulation cycles per iteration of the design exploration. Third, single thread performance does not scale linearly with number of hardware threads per core and number of cores due to memory wall effect. This means that at every step of the design process designers must ensure that single thread performance is not unacceptably slowed down while increasing overall throughput. While all these factors affect design decisions in both high performance and embedded many-core processors, the design of embedded processors required for complex embedded applications such as networking, smart power grids, battlefield decision-making, consumer electronics and biomedical devices to name a few, is fundamentally different from its high performance counterpart because of the need to consider (i) low power and (ii) real-time operations. This implies the design objective for embedded many-core processors cannot be to simply maximize performance, but improve it in such a way that overall power dissipation is minimized and all real-time constraints are met. This necessitates additional power estimation models right at the design stage to accurately measure the cost and reliability of all the candidate designs during the exploration phase. In this dissertation, a statistical machine learning (SML) based design exploration framework is presented which employs an execution-driven cycle- accurate simulator to accurately measure power and performance of embedded many-core processors. The embedded many-core processor domain is Network Processors (NePs) used to processed network IP packets. Future generation NePs required to operate at terabits per second network speeds captures all the aspects of a complex embedded application consisting of shared data structures, large volume of compute-intensive and data-intensive real-time bound tasks and a high level of task (packet) level parallelism. Statistical machine learning (SML) is used to efficiently model performance and power of candidate designs in terms of wide ranges of micro-architectural parameters. The method inherently minimizes number of in-loop simulations in the exploration framework and also efficiently captures the non-linear interactions between the micro-architectural design parameters. To ensure scalability, the design space is partitioned into (i) core-level micro-architectural parameters to optimize single core architectures subject to the real-time constraints and (ii) shared memory level micro- architectural parameters to explore the shared interconnection network and shared cache memory architectures and achieves overall optimality. The cost function of our exploration algorithm is the total power dissipation which is minimized, subject to the constraints of real-time throughput (as determined from the terabit optical network router line-speed) required in IP packet processing embedded application

    Improving Compute & Data Efficiency of Flexible Architectures

    Get PDF

    Architecting a One-to-many Traffic-Aware and Secure Millimeter-Wave Wireless Network-in-Package Interconnect for Multichip Systems

    Get PDF
    With the aggressive scaling of device geometries, the yield of complex Multi Core Single Chip(MCSC) systems with many cores will decrease due to the higher probability of manufacturing defects especially, in dies with a large area. Disintegration of large System-on-Chips(SoCs) into smaller chips called chiplets has shown to improve the yield and cost of complex systems. Therefore, platform-based computing modules such as embedded systems and micro-servers have already adopted Multi Core Multi Chip (MCMC) architectures overMCSC architectures. Due to the scaling of memory intensive parallel applications in such systems, data is more likely to be shared among various cores residing in different chips resulting in a significant increase in chip-to-chip traffic, especially one-to-many traffic. This one-to-many traffic is originated mainly to maintain cache-coherence between many cores residing in multiple chips. Besides, one-to-many traffics are also exploited by many parallel programming models, system-level synchronization mechanisms, and control signals. How-ever, state-of-the-art Network-on-Chip (NoC)-based wired interconnection architectures do not provide enough support as they handle such one-to-many traffic as multiple unicast trafficusing a multi-hop MCMC communication fabric. As a result, even a small portion of such one-to-many traffic can significantly reduce system performance as traditional NoC-basedinterconnect cannot mask the high latency and energy consumption caused by chip-to-chipwired I/Os. Moreover, with the increase in memory intensive applications and scaling of MCMC systems, traditional NoC-based wired interconnects fail to provide a scalable inter-connection solution required to support the increased cache-coherence and synchronization generated one-to-many traffic in future MCMC-based High-Performance Computing (HPC) nodes. Therefore, these computation and memory intensive MCMC systems need an energy-efficient, low latency, and scalable one-to-many (broadcast/multicast) traffic-aware interconnection infrastructure to ensure high-performance. Research in recent years has shown that Wireless Network-in-Package (WiNiP) architectures with CMOS compatible Millimeter-Wave (mm-wave) transceivers can provide a scalable, low latency, and energy-efficient interconnect solution for on and off-chip communication. In this dissertation, a one-to-many traffic-aware WiNiP interconnection architecture with a starvation-free hybrid Medium Access Control (MAC), an asymmetric topology, and a novel flow control has been proposed. The different components of the proposed architecture are individually one-to-many traffic-aware and as a system, they collaborate with each other to provide required support for one-to-many traffic communication in a MCMC environment. It has been shown that such interconnection architecture can reduce energy consumption and average packet latency by 46.96% and 47.08% respectively for MCMC systems. Despite providing performance enhancements, wireless channel, being an unguided medium, is vulnerable to various security attacks such as jamming induced Denial-of-Service (DoS), eavesdropping, and spoofing. Further, to minimize the time-to-market and design costs, modern SoCs often use Third Party IPs (3PIPs) from untrusted organizations. An adversary either at the foundry or at the 3PIP design house can introduce a malicious circuitry, to jeopardize an SoC. Such malicious circuitry is known as a Hardware Trojan (HT). An HTplanted in the WiNiP from a vulnerable design or manufacturing process can compromise a Wireless Interface (WI) to enable illegitimate transmission through the infected WI resulting in a potential DoS attack for other WIs in the MCMC system. Moreover, HTs can be used for various other malicious purposes, including battery exhaustion, functionality subversion, and information leakage. This information when leaked to a malicious external attackercan reveals important information regarding the application suites running on the system, thereby compromising the user profile. To address persistent jamming-based DoS attack in WiNiP, in this dissertation, a secure WiNiP interconnection architecture for MCMC systems has been proposed that re-uses the one-to-many traffic-aware MAC and existing Design for Testability (DFT) hardware along with Machine Learning (ML) approach. Furthermore, a novel Simulated Annealing (SA)-based routing obfuscation mechanism was also proposed toprotect against an HT-assisted novel traffic analysis attack. Simulation results show that,the ML classifiers can achieve an accuracy of 99.87% for DoS attack detection while SA-basedrouting obfuscation could reduce application detection accuracy to only 15% for HT-assistedtraffic analysis attack and hence, secure the WiNiP fabric from age-old and emerging attacks
    • ā€¦
    corecore