10 research outputs found

    Longevity Framework: Leveraging Online Integrated Aging-Aware Hierarchical Mapping and VF-Selection for Lifetime Reliability Optimization in Manycore Processors

    Get PDF
    Rapid device aging in the nano era threatens system lifetime reliability, posing a major intrinsic threat to system functionality. Traditional techniques to overcome the aging-induced device slowdown, such as guardbanding are static and incur performance, power, and area penalties. In a manycore processor, the system-level design abstraction offers dynamic opportunities through the control of task-to-core mappings and per-core operation frequency towards more balanced core aging profile across the chip, optimizing the system lifetime reliability while meeting the application performance requirements. This article presents Longevity Framework (LF) that leverages online integrated aging-aware hierarchical mapping and voltage frequency (VF)-selection for lifetime reliability optimization in manycore processors. The mapping exploration is hierarchical to achieve scalability. The VF-selection builds on the trade-offs involved between power, performance, and aging as the VF is scaled while leveraging the per-core DVFS capabilities. The methodology takes the chip-wide process variation into account. Extensive experimentation, comparing the proposed approach with two state-of-the-art methods, for 64-core and 256-core systems running applications from PARSEC and SPLASH-2 benchmark suites, show an improvement of up to 3.2 years in the system lifetime reliability and 4Ă— improvement in the average core health

    Digital Timing Control in SRAMs for Yield Enhancement and Graceful Aging Degradation

    Get PDF
    Embedded SRAMs can occupy the majority of the chip area in SOCs. The increase in process variation and aging degradation due to technology scaling can severely compromise the integrity of SRAM memory cells, hence resulting in cell failures. Enough cell failures in a memory can lead to it being rejected during initial testing, and hence decrease the manufacturing yield. Or, as a result of long-term applied stress, lead to in-field system failures. Certain types of cell failures can be mitigated through improved timing control. Post-fabrication programmable timing can allow for after-the-fact calibration of timing signals on a per die basis. This allows for a SRAM's timing signals to be generated based on the characteristics specific to the individual chip, thus allowing for an increase in yield and reduction in in-field system failures. In this thesis, a delay line based SRAM timing block with digitally programmable timing signals has been implemented in a 180 nm CMOS technology. Various timing-related cell failure mechanisms including: 1). Operational Read Failures, 2). Cell Stability Failures, and 3). Power Envelope Failures are investigated. Additionally, the major contributing factors for process variation and device aging degradation are discussed in the context of SRAMs. Simulations show that programmable timing can be used to reduce cell failure rates by over 50%

    Power and Thermal Management of System-on-Chip

    Get PDF

    PDE Solvers for Hybrid CPU-GPU Architectures

    Get PDF
    Many problems of scientific and industrial interest are investigated through numerically solving partial differential equations (PDEs). For some of these problems, the scope of the investigation is limited by the costs of computational resources. A new approach to reducing these costs is the use of coprocessors, such as graphics processing units (GPUs) and Many Integrated Core (MIC) cards, which can execute floating point operations at a higher rate than a central processing unit (CPU) of the same cost. This is achieved through the use of a large number of processors in a single device, each with very limited dedicated memory per thread. Codes for a number of continuum methods, such as boundary element methods (BEM), finite element methods (FEM) and finite difference methods (FDM) have already been implemented on coprocessor architectures. These methods were designed before the adoption of coprocessor architectures, so implementing them efficiently with reduced thread-level memory can be challenging. There are other methods that do operate efficiently with limited thread-level memory, such as Monte Carlo methods (MCM) and lattice Boltzmann methods (LBM) for kinetic formulations of PDEs, but they are not competitive on CPUs and generally have poorer convergence than the continuum methods. In this work, we introduce a class of methods in which the parallelism of kinetic formulations on GPUs is combined with the better convergence of continuum methods on CPUs. We first extend an existing Feynman-Kac formulation for determining the principal eigenpair of an elliptic operator to create a version that can retrieve arbitrarily many eigenpairs. This new method is implemented for multiple GPUs, and combined with a standard deflation preconditioner on multiple CPUs to create a hybrid concurrent method with superior convergence to that of the deflation preconditioner alone. The hybrid method exhibits good parallelism, with an efficiency of 80% on a problem with 300 million unknowns, run on a configuration of 324 CPU cores and 54 GPUs.Doctor of Philosoph

    Simplified vector-thread architectures for flexible and efficient data-parallel accelerators

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student submitted PDF version of thesis.Includes bibliographical references (p. 165-170).This thesis explores a new approach to building data-parallel accelerators that is based on simplifying the instruction set, microarchitecture, and programming methodology for a vector-thread architecture. The thesis begins by categorizing regular and irregular data-level parallelism (DLP), before presenting several architectural design patterns for data-parallel accelerators including the multiple-instruction multiple-data (MIMD) pattern, the vector single-instruction multiple-data (vector-SIMD) pattern, the single-instruction multiple-thread (SIMT) pattern, and the vector-thread (VT) pattern. Our recently proposed VT pattern includes many control threads that each manage their own array of microthreads. The control thread uses vector memory instructions to efficiently move data and vector fetch instructions to broadcast scalar instructions to all microthreads. These vector mechanisms are complemented by the ability for each microthread to direct its own control flow. In this thesis, I introduce various techniques for building simplified instances of the VT pattern. I propose unifying the VT control-thread and microthread scalar instruction sets to simplify the microarchitecture and programming methodology. I propose a new single-lane VT microarchitecture based on minimal changes to the vector-SIMD pattern.(cont.) Single-lane cores are simpler to implement than multi-lane cores and can achieve similar energy efficiency. This new microarchitecture uses control processor embedding to mitigate the area overhead of single-lane cores, and uses vector fragments to more efficiently handle both regular and irregular DLP as compared to previous VT architectures. I also propose an explicitly data-parallel VT programming methodology that is based on a slightly modified scalar compiler. This methodology is easier to use than assembly programming, yet simpler to implement than an automatically vectorizing compiler. To evaluate these ideas, we have begun implementing the Maven data-parallel accelerator. This thesis compares a simplified Maven VT core to MIMD, vector-SIMD, and SIMT cores. We have implemented these cores with an ASIC methodology, and I use the resulting gate-level models to evaluate the area, performance, and energy of several compiled microbenchmarks. This work is the first detailed quantitative comparison of the VT pattern to other patterns. My results suggest that future data-parallel accelerators based on simplified VT architectures should be able to combine the energy efficiency of vector-SIMD accelerators with the flexibility of MIMD accelerators.by Christopher Francis Batten.Ph.D

    Heterogeneous multicore systems for signal processing

    Get PDF
    This thesis explores the capabilities of heterogeneous multi-core systems, based on multiple Graphics Processing Units (GPUs) in a standard desktop framework. Multi-GPU accelerated desk side computers are an appealing alternative to other high performance computing (HPC) systems: being composed of commodity hardware components fabricated in large quantities, their price-performance ratio is unparalleled in the world of high performance computing. Essentially bringing “supercomputing to the masses”, this opens up new possibilities for application fields where investing in HPC resources had been considered unfeasible before. One of these is the field of bioelectrical imaging, a class of medical imaging technologies that occupy a low-cost niche next to million-dollar systems like functional Magnetic Resonance Imaging (fMRI). In the scope of this work, several computational challenges encountered in bioelectrical imaging are tackled with this new kind of computing resource, striving to help these methods approach their true potential. Specifically, the following main contributions were made: Firstly, a novel dual-GPU implementation of parallel triangular matrix inversion (TMI) is presented, addressing an crucial kernel in computation of multi-mesh head models of encephalographic (EEG) source localization. This includes not only a highly efficient implementation of the routine itself achieving excellent speedups versus an optimized CPU implementation, but also a novel GPU-friendly compressed storage scheme for triangular matrices. Secondly, a scalable multi-GPU solver for non-hermitian linear systems was implemented. It is integrated into a simulation environment for electrical impedance tomography (EIT) that requires frequent solution of complex systems with millions of unknowns, a task that this solution can perform within seconds. In terms of computational throughput, it outperforms not only an highly optimized multi-CPU reference, but related GPU-based work as well. Finally, a GPU-accelerated graphical EEG real-time source localization software was implemented. Thanks to acceleration, it can meet real-time requirements in unpreceeded anatomical detail running more complex localization algorithms. Additionally, a novel implementation to extract anatomical priors from static Magnetic Resonance (MR) scansions has been included

    Optimisation énergétique de processus de traitement du signal et ses applications au décodage vidéo

    Get PDF
    Consumer electronics offer today more and more features (video, audio, GPS, Internet) and connectivity means (multi-radio systems with WiFi, Bluetooth, UMTS, HSPA, LTE-advanced ... ). The power demand of these devices is growing for the digital part especially for the processing chip. To support this ever increasing computing demand, processor architectures have evolved with multicore processors, graphics processors (GPU) and ether dedicated hardware accelerators. However, the evolution of battery technology is itself slower. Therefore, the autonomy of embedded systems is now under a great pressure. Among the new functionalities supported by mobile devices, video services take a prominent place. lndeed, recent analyzes show that they will represent 70% of mobile Internet traffic by 2016. Accompanying this growth, new technologies are emerging for new services and applications. Among them HEVC (High Efficiency Video Coding) can double the data compression while maintaining a subjective quality equivalent to its predecessor, the H.264 standard. ln a digital circuit, the total power consumption is made of static power and dynamic power. Most of modern hardware architectures implement means to control the power consumption of the system. Dynamic Voltage and Frequency Scaling (DVFS) mainly reduces the dynamic power of the circuit. This technique aims to adapt the power of the processor (and therefore its consumption) to the actual load needed by the application. To control the static power, Dynamic Power Management (DPM or sleep modes) aims to stop the voltage supplies associated with specific areas of the chip. ln this thesis, we first present a model of the energy consumed by the circuit integrating DPM and DVFS modes. This model is generalized to multi-core integrated circuits and to a rapid prototyping tool. Thus, the optimal operating point of a circuit, i.e. the operating frequency and the number of active cores, is identified. Secondly, the HEVC application is integrated to a multicore architecture coupled with a sophisticated DVFS mechanism. We show that this application can be implemented efficiently on general purpose processors (GPP) while minimizing the power consumption. Finally, and to get further energy gain, we propose a modified HEVC decoder that is capable to tune its energy gains together with a decoding quality trade-off.Aujourd'hui, les appareils électroniques offrent de plus en plus de fonctionnalités (vidéo, audio, GPS, internet) et des connectivités variées (multi-systèmes de radio avec WiFi, Bluetooth, UMTS, HSPA, LTE-advanced ... ). La demande en puissance de ces appareils est donc grandissante pour la partie numérique et notamment le processeur de calcul. Pour répondre à ce besoin sans cesse croissant de nouvelles fonctionnalités et donc de puissance de calcul, les architectures des processeurs ont beaucoup évolué : processeurs multi-coeurs, processeurs graphiques (GPU) et autres accélérateurs matériels dédiés. Cependant, alors que de nouvelles architectures matérielles peinent à répondre aux exigences de performance, l'évolution de la technologie des batteries est quant à elle encore plus lente. En conséquence, l'autonomie des systèmes embarqués est aujourd'hui sous pression. Parmi les nouveaux services supportés par les terminaux mobiles, la vidéo prend une place prépondérante. En effet, des analyses récentes de tendance montrent qu'elle représentera 70 % du trafic internet mobile dès 2016. Accompagnant cette croissance, de nouvelles technologies émergent permettant de nouveaux services et applications. Parmi elles, HEVC (High Efficiency Video Coding) permet de doubler la compression de données tout en garantissant une qualité subjective équivalente à son prédécesseur, la norme H.264. Dans un circuit numérique, la consommation provient de deux éléments: la puissance statique et la puissance dynamique. La plupart des architectures matérielles récentes mettent en oeuvre des procédés permettant de contrôler la puissance du système. Le changement dynamique du couple tension/fréquence appelé Dynamic Voltage and Frequency Scaling (DVFS) agit principalement sur la puissance dynamique du circuit. Cette technique permet d'adapter la puissance du processeur (et donc sa consommation) à la charge réelle nécessaire pour une application. Pour contrôler la puissance statique, le Dynamic Power Management (DPM, ou modes de veille) consistant à arrêter les alimentations associées à des zones spécifiques de la puce. Dans cette thèse, nous présentons d'abord une modélisation de l'énergie consommée par le circuit intégrant les modes DVFS et DPM. Cette modélisation est généralisée au circuit multi-coeurs et intégrée à un outil de prototypage rapide. Ainsi le point de fonctionnement optimal d'un circuit, la fréquence de fonctionnement et le nombre de coeurs actifs, est identifié. Dans un second temps, l'application HEVC est intégrée à une architecture multi-coeurs avec une adaptation dynamique de la fréquence de développement. Nous montrons que cette application peut être implémentée efficacement sur des processeurs généralistes (GPP) tout en minimisant la puissance consommée. Enfin, et pour aller plus loin dans les gains en énergie, nous proposons une modification du décodeur HEVC qui permet à un décodeur de baisser encore plus sa consommation en fonction du budget énergétique disponible localement

    Générateur distribué d'horloge pour puces globalement et localement synchrones de grande taille

    Get PDF
    This thesis addresses the problem of global synchronization of large system on chip (SoC). It focuses on the study of an alternative clock generation technique to conventional clock distribution and asynchronous communication. It allows implementation of highly reliable synchronous circuit. My PhD project aims to study and implement a large network (10x10) of all digital phase-locked loop (ADPLL), containing 100 nodes generating a clock for each local digital circuitry. The prototype was implemented on silicon generating clocks in the range 903-1161 MHz. It highlights a maximum phase error of less than 40 ps between two clocks in any neighboring zones. Another important result is the analysis of phase error between two non-neighboring oscillators in distance. By studying an FPGA prototype of the network, we obtained that maximum phase error at steady state between any clock signal and the reference signal is less than three steps of the PFD quantification steps. In order to validate the performance of synchronization in ASIC, we designed an on-chip clocking error measurement circuit. This circuit has a low rate for the off-chip readout (several MHz), and a high resolution (+-2.5 ps). Reconfigurability is another attractive feature. We have explored this feature and proposed a novel topology with different configurations for nodes on the border and in the kernel of the network. This topology has an advantage in prohibiting phase error propagation and reflection.Cette thèse aborde le problème de la synchronisation globale de grand système sur puce (SoC). Il est centré sur l'étude d'une technique de remplacement de la distribution d'horloge classique et d'une communication asynchrone. Il permet la mise en œuvre de circuit synchrone très fiable. Mon projet de thèse vise à étudier et mettre en œuvre un vaste réseau (10x10) de boucle à verrouillage de phase tous numérique (ADPLL), contenant 100 nœuds générant une horloge pour chaque circuit numérique local. Le prototype a été réalisé sur les horloges de génération de silicium dans la gamme de 903-1161 MHz. Elle met en évidence une erreur de phase maximale de moins de 40 ps entre deux horloges dans toutes les zones voisines. Un autre résultat important est l'analyse de l'erreur de phase entre les deux oscillateurs non-voisins dans la distance. En étudiant un prototype FPGA du réseau, on a obtenu que l'erreur de phase maximale à l'état d'équilibre entre un signal d'horloge et le signal de référence est inférieur à trois étapes des étapes de quantification PFD. Afin de valider les performances de la synchronisation dans ASIC, nous avons conçu un circuit d'une erreur de mesure sur la puce d'horloge. Ce circuit a un taux faible de la lecture hors puce (quelques MHz), et une résolution élevée (+ -2,5 ps). Reconfiguration constitue une autre caractéristique intéressante. Nous avons exploré cette fonction et a proposé une nouvelle topologie avec des configurations différentes pour les nœuds sur la frontière et dans le noyau du réseau. Cette topologie présente un avantage en interdisant la propagation des erreurs de phase et de réflexion
    corecore