10 research outputs found
Power aware early design stage hardware software co-optimization
Co-optimizing hardware and software can lead to substantial performance and energy benefits, and is becoming an increasingly important design paradigm. In scientific computing, power constraints increasingly necessitate the return to specialized chips such as Intel’s MIC or IBM’s Blue-Gene architectures. To enable hardware/software co-design in early stages of the design cycle, we propose a simulation infrastructure methodology by combining high-abstraction performance simulation using Sniper with power modeling using McPAT and custom DRAM power models. Sniper/McPAT is fast — simulation speed is around 2 MIPS on an 8-core host machine — because it uses analytical modeling to abstract away core performance during multi-core simulation. We demonstrate Sniper/McPAT’s accuracy through validation against real hardware; we report average performance and power prediction errors of 22.1% and 8.3%, respectively, for a set of SPEComp benchmarks
Using fast and accurate simulation to explore hardware/software trade-offs in the multi-core era
Writing well-performing parallel programs is challenging in the multi-core processor era. In addition to achieving good per-thread performance, which in itself is a balancing act between instruction-level parallelism, pipeline effects and good memory performance, multi-threaded programs complicate matters even further. These programs require synchronization, and are affected by the interactions between threads through sharing of both processor resources and the cache hierarchy.
At the Intel Exascience Lab, we are developing an architectural simulator called Sniper for simulating future exascale-era multi-core processors. Its goal is twofold: Sniper should assist hardware designers to make design decisions, while simultaneously providing software designers with a tool to gain insight into the behavior of their algorithms and allow for optimization. By taking architectural features into account, our simulator can provide more insight into parallel programs than what can be obtained from existing performance analysis tools. This unique combination of hardware simulator and software performance analysis tool makes Sniper a useful tool for a simultaneous exploration of the hardware and software design space for future high-performance multi-core systems
Recommended from our members
Multiple clock domain synchronization for network on chips
This thesis provides a new framework for the design of very high performance yet low power System on Chips (SoCs). Network on chip (NoC) is emerging as a revolutionary methodology to integrate numerous Intellectual Property (IP) blocks in a single Systemon-Chip (SoC) and solving the performance limitations arising out of long interconnects. Continued advancement of NoC designs is heavily dependent on the ability to effectively communicate among the constituent Intellectual Property (IP) blocks/Embedded cores, as well as manage/reduce energy dissipation. This work first presents a low-latency, lowenergy synchronization mechanism for Network on Chip architectures, which enables the network to span a system-on-chip (SoC) with multiple independent clock domains. The proposed interface scheme has been compared to another existing scheme and shown to outperform it in terms of latency and energy dissipation. The synchronizers were introduced in the communication fabric for seamless integration of the different Intellectual Property (IP) blocks. As communication happens across clock domains, the clock distribution scheme over the entire network was redesigned for greater savings in power. It is shown that communication energy can be optimized by selecting an appropriate number of different clock regions and their relative placement. It is demonstrated that in a mesh-based NoC the communication energy initially decreases with increasing number of clock domains, but beyond a certain threshold it shows an increasing trend due to synchronization overhead
NETWORK ON CHIP BASED HARDWARE ACCELERATORS FOR COMPUTATIONAL BIOLOGY
As clock frequency of systems are no longer scaling, the computer architecture community is exploring different strategies to continue application scaling and also looking for novel applications which might benefit from this research. One such direction is the multi-core approach. The application problem is partitioned into smaller sub-problems and using the divide and conquer approach to solve the problem. Network on Chip offers a promising methodology to integrate a large number of cores onto a single chip and also efficiently manage the communication amongst them. The demand for high throughput, low power and low latency interconnection is pushing for the adoption of this scheme even further. Modern scientific computing is offering several challenging problems for the computer architecture community to work on. Computational biology is one such domain, where the problems of interest are data intensive, compute intensive, and communication intensive in variant combinations and one size does not fit all the applications. Biocomputing will therefore need architecture and resources that map to the diverse hardware portfolio. In this work, the complete design and performance evaluation has been carried out for two such biocomputing applications namely sequence alignment and phylogenetic reconstruction. Major challenge in both the problems, arises out of the limitation of the available on-chip memory, which puts a bound on the amount of scalability of the problem size. It is been demonstrated that significant amount of speedup can be achieved for problems of manageable size, with much less power dissipation compared to the currently available solutions
Rh(II)-Catalysed N2-Selective Arylation of Benzotriazoles and Indazoles using Quinoid Carbenes via 1,5-H Shift
A Rh(II)-catalyzed highly selective N2-arylation of benzotriazole is developed with wide scope and good functional group tolerance. The reaction is also extended on indazole and substituted 1,2,3-triazole scaffolds. In addition, late-stage arylation of benzotriazoles tethered with bioactive molecules is realized under the developed conditions. Control experiments and DFT calculations reveal that presumably, the reaction proceeds via nucleophilic addition of N2 (of 1H tautomer) center to metal-carbene followed by 1,5-H shift. This differs from classical X-H insertion into carbene centers and subsequent 1,2-H shift
Power-Aware Multi-Core Simulation for Early Design Stage Hardware/Software Co-Optimization
Stringent performance targets and power constraints push designers towards building specialized workload-optimized systems across a broad spectrum of the computing arena, including supercomputing applications as exemplified by the IBM BlueGene and Intel MIC architectures. In this paper, we make the case for hardware/software co-design during early design stages of processors for scientific computing applications. Considering an important scientific kernel, namely stencil computation, we demonstrate that performance and energy-efficiency can be improved by a factor of 1.66 × and 1.25×, respectively, by co-optimizing hardware and software. To enable hardware/software co-design in early stages of the design cycle, we propose a novel simulation infrastructure by combining high-abstraction performance simulation using Sniper with power modeling using McPAT and custom DRAM power models. Sniper/McPAT is fast — simulation speed is around 2 MIPS on an 8-core host machine — because it uses analytical modeling to abstract away core performance during multi-core simulation. We demonstrate Sniper/McPAT’s accuracy through validation against real hardware; we report average performance and power prediction errors of 22.1 % and 8.3%, respectively, for a set of SPEComp benchmarks