10 research outputs found

    Hardware/Software Co-design for Multicore Architectures

    Get PDF
    Siirretty Doriast

    Static task mapping for tiled chip multiprocessors with multiple voltage islands

    Get PDF
    The complexity of large Chip Multiprocessors (CMP) makes design reuse a practical approach to reduce the manufacturing and design cost of high-performance systems. This paper proposes techniques for static task mapping onto general-purpose CMPs with multiple pre-defined voltage islands for power management. The CMPs are assumed to contain different classes of processing elements with multiple voltage/frequency execution modes to better cover a large range of applications. Task mapping is performed with awareness of both on-chip and off-chip memory traffic, and communication constraints such as the link and memory bandwidth. Besides proposing a linear programming model for small systems, a novel mapping approach based on Extremal Optimization is proposed for large-scale CMPs. This new combinatorial optimization method has delivered very good results in quality and computational cost when compared to the classical simulated annealing.Postprint (published version

    Reliability in the face of variability in nanometer embedded memories

    Get PDF
    In this thesis, we have investigated the impact of parametric variations on the behaviour of one performance-critical processor structure - embedded memories. As variations manifest as a spread in power and performance, as a first step, we propose a novel modeling methodology that helps evaluate the impact of circuit-level optimizations on architecture-level design choices. Choices made at the design-stage ensure conflicting requirements from higher-levels are decoupled. We then complement such design-time optimizations with a runtime mechanism that takes advantage of adaptive body-biasing to lower power whilst improving performance in the presence of variability. Our proposal uses a novel fully-digital variation tracking hardware using embedded DRAM (eDRAM) cells to monitor run-time changes in cache latency and leakage. A special fine-grain body-bias generator uses the measurements to generate an optimal body-bias that is needed to meet the required yield targets. A novel variation-tolerant and soft-error hardened eDRAM cell is also proposed as an alternate candidate for replacing existing SRAM-based designs in latency critical memory structures. In the ultra low-power domain where reliable operation is limited by the minimum voltage of operation (Vddmin), we analyse the impact of failures on cache functional margin and functional yield. Towards this end, we have developed a fully automated tool (INFORMER) capable of estimating memory-wide metrics such as power, performance and yield accurately and rapidly. Using the developed tool, we then evaluate the #effectiveness of a new class of hybrid techniques in improving cache yield through failure prevention and correction. Having a holistic perspective of memory-wide metrics helps us arrive at design-choices optimized simultaneously for multiple metrics needed for maintaining lifetime requirements

    Optimisation de dispositifs FDSOI pour la gestion de la consommation et de la vitesse (application aux mémoires et fonctions logiques)

    Get PDF
    Avec la percée des téléphones portables et des tablettes numériques intégrant des fonctions avancées de traitement de l'information, une croissance exponentielle du marché des systèmes sur puce (SoC pour System On Chip en anglais) est attendue jusqu'en 2016. Ces systèmes, conçus dans les dernières technologies nanométriques, nécessitent des vitesses de fonctionnement très élevées pour offrir des performances incroyables, tout en consommant remarquablement peu. Cependant, concevoir de tels systèmes à l'échelle nanométrique présente de nombreux enjeux en raison de l'accentuation d'effets parasites avec la miniaturisation des transistors MOS sur silicium massif, rendant les circuits plus sensibles aux phénomènes de fluctuations des procédés de fabrication et moins efficaces énergétiquement. La technologie planaire complètement désertée (FD pour Fully depleted en anglais) SOI, offrant un meilleur contrôle du canal du transistor et une faible variabilité de sa tension de seuil grâce à un film de silicium mince et non dopé, apparaît comme une solution technologique très bien adaptée pour répondre aux besoins de ces dispositifs nomades alliant hautes performances et basse consommation. Cependant pour que cette technologie soit viable, il est impératif qu'elle réponde aux besoins des plateformes de conception basse consommation. Un des défis majeurs de l'état de l'art de la technologie planaire FDSOI est de fournir les différentes tensions de seuils (VT) requises pour la gestion de la consommation et de la vitesse. Le travail de recherche de thèse présenté dans ce mémoire a contribué à la mise en place d'une plateforme de conception multi-VT en technologie planaire FDSOI sur oxyde enterré mince (UTB pour Ultra Thin Buried oxide en anglais) pour les nœuds technologiques sub-32 nm. Pour cela, les éléments clefs des plateformes de conception basse consommation en technologie planaire sur silicium massif ont été identifiés. A la suite de cette analyse, différentes architectures de transistors MOS multi-VT FDSOI ont été développées. L'analyse au niveau des circuits numériques et mémoires élémentaires a permis de mettre en avant deux solutions fiables, efficaces et de faible complexité technologique. Les performances des solutions apportées ont été évaluées sur un chemin critique extrait du cœur de processeur ARM Cortex A9 et sur une cellule SRAM 6T haute densité (0,120 m ). Egalement, une cellule SRAM à quatre transistors est proposée, démontrant la flexibilité au niveau conception des solutions proposées. Ce travail de recherche a donné lieu à de nombreuses publications, communications et brevets. Aujourd'hui, la majorité des résultats obtenus ont été transférés chez STMicroelectronics, où l'étude de leur industrialisation est en cours.Driven by the strong growth of smartphone and tablet devices, an exponential growth for the mobile SoC market is forecasted up to 2016. These systems, designed in the latest nanometre technology, require very high speeds to deliver tremendous performances, while consuming remarkably little. However, designing such systems at the nanometre scale introduces many challenges due to the emphasis of parasitic phenomenon effects driven by the scaling of bulk MOSFETs, making circuits more sensitive to the manufacturing process fluctuations and less energy efficient. Undoped thin-film planar fully depleted silicon-on-insulator (FDSOI) devices are being investigated as an alternative to bulk devices in 28nm node and beyond, thanks to its excellent short-channel electrostatic control, low leakage currents and immunity to random dopant fluctuation. This compelling technology appears to meet the needs of nomadic devices, combining high performance and low power consumption. However, to be useful, it is essential that this technology is compatible with low operating power design platforms. A major challenge for this technology is to provide various device threshold voltages (VT), trading off power consumption and speed. The research work presented in this thesis has contributed to the development of a multi-VT design platform in FDSOI planar technology on thin buried oxide (UTB) for the 28nm and below technology nodes. In this framework, the key elements of the low power design platform in bulk planar technology have been studied. Based on this analysis, different architectures of FDSOI multi-VT MOSFETs have been developed. The analysis on the layout of elementary circuits, such as standard cells and SRAM cells, has put forward two reliable, efficient and low technological complexity multi- strategies. Finally, the performances of these solutions have been evaluated on a critical path extracted from the ARM Cortex A9 processor and a high-density 6T SRAM cell (0.120 m ). Also, an SRAM cell with four transistors has been proposed, highlighting the design flexibility brought by these solutions. This thesis has resulted in many publications, communications and patents. Today, the majority of the results obtained have been transferred to STMicroelectronics, where the industrialization is in progress.SAVOIE-SCD - Bib.électronique (730659901) / SudocGRENOBLE1/INP-Bib.électronique (384210012) / SudocGRENOBLE2/3-Bib.électronique (384219901) / SudocSudocFranceF

    Performance Analysis of Complex Shared Memory Systems

    Get PDF
    Systems for high performance computing are getting increasingly complex. On the one hand, the number of processors is increasing. On the other hand, the individual processors are getting more and more powerful. In recent years, the latter is to a large extent achieved by increasing the number of cores per processor. Unfortunately, scientific applications often fail to fully utilize the available computational performance. Therefore, performance analysis tools that help to localize and fix performance problems are indispensable. Large scale systems for high performance computing typically consist of multiple compute nodes that are connected via network. Performance analysis tools that analyze performance problems that arise from using multiple nodes are readily available. However, the increasing number of cores per processor that can be observed within the last decade represents a major change in the node architecture. Therefore, this work concentrates on the analysis of the node performance. The goal of this thesis is to improve the understanding of the achieved application performance on existing hardware. It can be observed that the scaling of parallel applications on multi-core processors differs significantly from the scaling on multiple processors. Therefore, the properties of shared resources in contemporary multi-core processors as well as remote accesses in multi-processor systems are investigated and their respective impact on the application performance is analyzed. As a first step, a comprehensive suite of highly optimized micro-benchmarks is developed. These benchmarks are able to determine the performance of memory accesses depending on the location and coherence state of the data. They are used to perform an in-depth analysis of the characteristics of memory accesses in contemporary multi-processor systems, which identifies potential bottlenecks. However, in order to localize performance problems, it also has to be determined to which extend the application performance is limited by certain resources. Therefore, a methodology to derive metrics for the utilization of individual components in the memory hierarchy as well as waiting times caused by memory accesses is developed in the second step. The approach is based on hardware performance counters, which record the number of certain hardware events. The developed micro-benchmarks are used to selectively stress individual components, which can be used to identify the events that provide a reasonable assessment for the utilization of the respective component and the amount of time that is spent waiting for memory accesses to complete. Finally, the knowledge gained from this process is used to implement a visualization of memory related performance issues in existing performance analysis tools. The results of the micro-benchmarks reveal that the increasing number of cores per processor and the usage of multiple processors per node leads to complex systems with vastly different performance characteristics of memory accesses depending on the location of the accessed data. Furthermore, it can be observed that the aggregated throughput of shared resources in multi-core processors does not necessarily scale linearly with the number of cores that access them concurrently, which limits the scalability of parallel applications. It is shown that the proposed methodology for the identification of meaningful hardware performance counters yields useful metrics for the localization of memory related performance limitations

    A Unified Infrastructure for Monitoring and Tuning the Energy Efficiency of HPC Applications

    Get PDF
    High Performance Computing (HPC) has become an indispensable tool for the scientific community to perform simulations on models whose complexity would exceed the limits of a standard computer. An unfortunate trend concerning HPC systems is that their power consumption under high-demanding workloads increases. To counter this trend, hardware vendors have implemented power saving mechanisms in recent years, which has increased the variability in power demands of single nodes. These capabilities provide an opportunity to increase the energy efficiency of HPC applications. To utilize these hardware power saving mechanisms efficiently, their overhead must be analyzed. Furthermore, applications have to be examined for performance and energy efficiency issues, which can give hints for optimizations. This requires an infrastructure that is able to capture both, performance and power consumption information concurrently. The mechanisms that such an infrastructure would inherently support could further be used to implement a tool that is able to do both, measuring and tuning of energy efficiency. This thesis targets all steps in this process by making the following contributions: First, I provide a broad overview on different related fields. I list common performance measurement tools, power measurement infrastructures, hardware power saving capabilities, and tuning tools. Second, I lay out a model that can be used to define and describe energy efficiency tuning on program region scale. This model includes hardware and software dependent parameters. Hardware parameters include the runtime overhead and delay for switching power saving mechanisms as well as a contemplation of their scopes and the possible influence on application performance. Thus, in a third step, I present methods to evaluate common power saving mechanisms and list findings for different x86 processors. Software parameters include their performance and power consumption characteristics as well as the influence of power-saving mechanisms on these. To capture software parameters, an infrastructure for measuring performance and power consumption is necessary. With minor additions, the same infrastructure can later be used to tune software and hardware parameters. Thus, I lay out the structure for such an infrastructure and describe common components that are required for measuring and tuning. Based on that, I implement adequate interfaces that extend the functionality of contemporary performance measurement tools. Furthermore, I use these interfaces to conflate performance and power measurements and further process the gathered information for tuning. I conclude this work by demonstrating that the infrastructure can be used to manipulate power-saving mechanisms of contemporary x86 processors and increase the energy efficiency of HPC applications

    Adaptive Distributed Architectures for Future Semiconductor Technologies.

    Full text link
    Year after year semiconductor manufacturing has been able to integrate more components in a single computer chip. These improvements have been possible through systematic shrinking in the size of its basic computational element, the transistor. This trend has allowed computers to progressively become faster, more efficient and less expensive. As this trend continues, experts foresee that current computer designs will face new challenges, in utilizing the minuscule devices made available by future semiconductor technologies. Today's microprocessor designs are not fit to overcome these challenges, since they are constrained by their inability to handle component failures by their lack of adaptability to a wide range of custom modules optimized for specific applications and by their limited design modularity. The focus of this thesis is to develop original computer architectures, that can not only survive these new challenges, but also leverage the vast number of transistors available to unlock better performance and efficiency. The work explores and evaluates new software and hardware techniques to enable the development of novel adaptive and modular computer designs. The thesis first explores an infrastructure to quantitatively assess the fallacies of current systems and their inadequacy to operate on unreliable silicon. In light of these findings, specific solutions are then proposed to strengthen digital system architectures, both through hardware and software techniques. The thesis culminates with the proposal of a radically new architecture design that can fully adapt dynamically to operate on the hardware resources available on chip, however limited or abundant those may be.PHDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/102405/1/apellegr_1.pd
    corecore