8 research outputs found
A distributed interleaving scheme for efficient access to WideIO DRAM memory
Achieving the main memory (DRAM) required bandwidth at acceptable
power levels for current and future applications is a major
challenge for System-on-Chip designers for mobile platforms.
Three dimensional (3D) integration and 3D stacked DRAM memories
promise to provide a significant boost in bandwidth at low
power levels by exploiting multiple channels and wide data interfaces.
In this paper, we address the problem of efficiently exploiting
the multiple channels provided by standard (JEDEC’s WIDEIO)
3D-stacked memories, to extract maximal effective bandwidth
and minimize latency for main memory access. We propose a new
distributed interleaved access method that leverages the on-chip interconnect
to simplify the design and implementation of the DRAM
controller, without impacting performance compared to traditional
centralized implementations. We perform experiments on realistic
workload for a mobile communication and multimedia platform
and show that our proposed distributed interleaving memory access
method improves the overall throughput while minimally impacting
the performance of latency sensitive communication flows
Design Methods and Tools for Application-Specific Predictable Networks-on-Chip
As the complexity of applications grows with each new generation, so does the demand for computation power. To satisfy the computation demands at manageable power levels, we see a shift in the design paradigm from single processor systems to Multiprocessor Systems-on-Chip (MPSoCs). MPSoCs leverage the parallelism in applications to increase the performance at the same power levels. To further improve the computation to power consumption ratio, MPSoCs for embedded applications are heterogeneous and integrate cores that are specialized to perform the different functionalities of the application. With technology scaling, wire power consumption is increasing compared to logic, making communication as expensive as computation. Therefore customizing the interconnect is necessary to achieve energy efficiency. Designing an optimal application specific Network-on-Chip (NoC), that meets application demands, requires the exploration of a large design space. Automatic design and optimization of the NoC is required in order to achieve fast design closure, especially for heterogeneous MPSoCs. To continue to meet the computation requirements of future applications new technologies are emerging. Three dimensional integration promises to increase the number of transistors by stacking multiple silicon layers. This will lead to an increase in the number of cores of the MPSoCs resulting in increased communication demands. To compensate for the increase in the wire delay in new technology nodes as well as to reduce the power consumption further, multi-synchronous design is becoming popular. With multiple clock signals, different parts of the MPSoC can be clocked at different frequencies according to the current demands of the application and can even be shutdown when they are not used at all. This further complicates the design of the NoC.Many applications require different levels of guarantee from the NoC in order to perform their functionality correctly. As communication traffic patterns become more complex, the performance of the NoC can no longer be predicted statically. Therefore designing the interconnect network requires that such guarantees are provided during the dynamic operation of the system which includes the interaction with major subsystems (i.e., main memory) and not just the interconnect itself. In this thesis, I present novel methods to design application-specific NoCs that meet performance demands, under the constraints of new technologies. To provide different levels of Quality of Service, I integrate methods to estimate the NoC performance during the design phase of the interconnect topology. I present methods and architectures for NoCs to efficiently access memory systems, in order to achieve predictable operation of the systems from the point of view of the communication as well as the bottleneck target devices. Therefore the main contribution of the thesis is twofold: scientific as I propose new algorithms to perform topology synthesis and engineering by presenting extensive experiments and architectures for NoC design
Balancing reliability, cost, and performance tradeoffs with FreeFault
Abstract—Memory errors have been a major source of system failures and fault rates may rise even further as memory continues to scale. This increasing fault rate, especially when combined with advent of integrated on-package memories, may exceed the capabilities of traditional fault tolerance mecha-nisms or significantly increase their overhead. In this paper, we present FreeFault as a hardware-only, transparent, and nearly-free resilience mechanism that is implemented entirely within a processor and can tolerate the majority of DRAM faults. FreeFault repurposes portions of the last-level cache for storing retired memory regions and augments a hardware memory scrubber to monitor memory health and aid retirement decisions. Because it relies on existing structures (cache associativity) for retirement/remapping type repair, FreeFault has essentially no hardware overhead. Because it requires a very modest portion of the cache (as small as 8KB) to cover a large fraction of DRAM faults, FreeFault has almost no impact on performance. We explain how FreeFault adds an attractive layer in an overall resilience scheme of highly-reliable and highly-available systems by delaying, and even entirely avoiding, calling upon software to make tradeoff decisions between memory capacity, performance, and reliability. I
Recommended from our members
Performance-efficient mechanisms for managing irregularity in throughput processors
textRecent graphics processing units (GPUs) have emerged as a promising platform for general purpose computing and have been shown to be very efficient in executing parallel applications with regular control and memory access behavior. Current GPU architectures primarily adopt the single-instruction multiple-thread (SIMT) programming model that balances programmability and hardware efficiency. With SIMT, the programmer writes application code to be executed by scalar threads and each thread is supported with conditional branch and fine-grained load/store instruction for ease of programming. At the same time, the hardware and software collaboratively enable the grouping of scalar threads to be executed in a vectorized single-instruction multiple-data (SIMD) in-order pipeline, simplifying hardware design. As GPUs gain momentum in being utilized in various application domains, these throughput processors will increasingly demand more efficient execution of irregular applications. Current GPUs, however, suffer from reduced thread-level parallelism, underutilization of compute resources, inefficient on-chip caching, and waste in off-chip memory bandwidth utilization for highly irregular programs with divergent control and memory accesses. In this dissertation, I develop techniques that enable simple, robust, and highly effective performance optimizations for SIMT-based throughput processor architectures such that they can better manage irregularity. I first identify that previously suggested optimizations to the divergent control flow problem suffers from the following limitations: 1) serialized execution of diverging paths, 2) lack of robustness across regular/irregular codes, and 3) limited applicability. Based on such observations, I propose and evaluate three novel mechanisms that resolve the aforementioned issues, providing significant performance improvements while minimizing implementation overhead. In the second half of the dissertation, I observe that conventional coarse-grained memory hierarchy designs do not take into account the massively multi-threaded nature of GPUs, which leads to substantial waste in off-chip memory bandwidth utilization. I design and evaluate a locality-aware memory hierarchy for throughput processors, which retains the advantages of coarse-grained accesses for spatially and temporally local programs while permitting selective fine-grained access to memory. By adaptively adjusting the access granularity, memory bandwidth and energy consumption are reduced for data with low spatial/temporal locality without wasting control overheads or prefetching potential for data with high spatial locality.Electrical and Computer Engineerin
A Comprehensive Study of DRAM Controllers in Real-Time Systems
The DRAM main memory is a critical component and a performance bottleneck of almost all computing systems. Since the DRAM is a shared memory resource on multi-core plat- forms, all cores contend for the memory bandwidth. Therefore, there is a keen interest in the real-time community to design predictable DRAM controllers to provide a low memory access latency bound to meet the strict timing requirement of real-time applications.
Due to the lack of generalization of publicly available DRAM controller models in full-system and DRAM device simulators, researchers often design in-house simulator to validate their designs. An extensible cycle-accurate DRAM controller simulation frame- work is developed to simplify the process of validating new DRAM controller designs. To prove the extensibility and reusability of the framework, ten state-of-the-art predictable DRAM controllers are implemented in the framework with less than 200 lines of new code.
With the help of the framework, a comprehensive evaluation of state-of-the-art pre- dictable DRAM controllers is performed analytically and experimentally to show the im- pact of different system parameters. This extensive evaluation allows researchers to assess the contribution of state-of-the-art DRAM controller approaches.
At last, a novel DRAM controller with request reordering technique is proposed to provide a configurable trade-off between latency bound and bandwidth in mixed-critical systems. Compared to the state-of-the-art DRAM controller, there is a balance point between the two designs which depends on the locality of the task under analysis and the DRAM device used in the system
Circuit Techniques for Adaptive and Reliable High Performance Computing.
Increasing power density with process scaling has caused stagnation in the clock speed of modern microprocessors. Accordingly, designers have adopted message passing and shared memory based multicore architectures in order to keep up with the rapidly rising demand for computing throughput. At the same time, applications are not entirely parallel and improving single-thread performance continues to remain critical. Additionally, reliability is also worsening with process scaling, and margining for failures due to process and environmental variations in modern technologies consumes an increasingly large portion of the power/performance envelope. In the wake of multicore computing, reliability of signal synchronization between the cores is also becoming increasingly critical. This forces designers to search for alternate efficient methods to improve compute performance while addressing reliability. Accordingly, this dissertation presents innovative circuit and architectural techniques for variation-tolerance, performance and reliability targeted at datapath logic, signal synchronization and memories.
Firstly, a domino logic based design style for datapath logic is presented that uses Adaptive Robustness Tuning (ART) in addition to timing speculation to provide up to 71% performance gains over conventional domino logic in 32bx32b multiplier in 65nm CMOS. Margins are reduced until functionality errors are detected, that are used to guide the tuning.
Secondly, for signal synchronization across clock domains, a new class of dynamic logic based synchronizers with single-cycle synchronization latency is presented, where pulses, rather than stable intermediate voltages cause metastability. Such pulses are amplified using skewed inverters to improve mean time between failures by ~1e6x over jamb latches and double flip-flops at 2GHz in 65nm CMOS.
Thirdly, a reconfigurable sensing scheme for 6T SRAMs is presented that employs auto-zero calibration and pre-amplification to improve sensing reliability (by up to 1.2 standard deviations of NMOS threshold voltage in 28nm CMOS); this increased reliability is in turn traded for ~42% sensing speedup.
Finally, a main memory architecture design methodology to address reliability and power in the context of Exascale computing systems is presented. Based on 3D-stacked DRAMs, the methodology co-optimizes DRAM access energy, refresh power and the increased cost of error resilience, to meet stringent power and reliability constraints.PhDElectrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/107238/1/bharan_1.pd
Neural networks-on-chip for hybrid bio-electronic systems
PhD ThesisBy modelling the brains computation we can further our understanding
of its function and develop novel treatments for neurological disorders. The
brain is incredibly powerful and energy e cient, but its computation does
not t well with the traditional computer architecture developed over the
previous 70 years. Therefore, there is growing research focus in developing
alternative computing technologies to enhance our neural modelling capability,
with the expectation that the technology in itself will also bene t from
increased awareness of neural computational paradigms.
This thesis focuses upon developing a methodology to study the design
of neural computing systems, with an emphasis on studying systems suitable
for biomedical experiments. The methodology allows for the design to be
optimized according to the application. For example, di erent case studies
highlight how to reduce energy consumption, reduce silicon area, or to
increase network throughput.
High performance processing cores are presented for both Hodgkin-Huxley
and Izhikevich neurons incorporating novel design features. Further, a complete
energy/area model for a neural-network-on-chip is derived, which is
used in two exemplar case-studies: a cortical neural circuit to benchmark
typical system performance, illustrating how a 65,000 neuron network could
be processed in real-time within a 100mW power budget; and a scalable highperformance
processing platform for a cerebellar neural prosthesis. From
these case-studies, the contribution of network granularity towards optimal
neural-network-on-chip performance is explored