Search CORE

8 research outputs found

A distributed interleaving scheme for efficient access to WideIO DRAM memory

Author: C. Seiculescu
G. De Micheli
L. Benini
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2012
Field of study

Achieving the main memory (DRAM) required bandwidth at acceptable power levels for current and future applications is a major challenge for System-on-Chip designers for mobile platforms. Three dimensional (3D) integration and 3D stacked DRAM memories promise to provide a significant boost in bandwidth at low power levels by exploiting multiple channels and wide data interfaces. In this paper, we address the problem of efficiently exploiting the multiple channels provided by standard (JEDEC’s WIDEIO) 3D-stacked memories, to extract maximal effective bandwidth and minimize latency for main memory access. We propose a new distributed interleaved access method that leverages the on-chip interconnect to simplify the design and implementation of the DRAM controller, without impacting performance compared to traditional centralized implementations. We perform experiments on realistic workload for a mobile communication and multimedia platform and show that our proposed distributed interleaving memory access method improves the overall throughput while minimally impacting the performance of latency sensitive communication flows

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Scalable and bandwidth-efficient memory subsystem design for real-time systems

Author: Gomony M.D.
Publication venue: Technische Universiteit Eindhoven
Publication date: 01/01/2015
Field of study

Repository TU/e

Pure OAI Repository

Design Methods and Tools for Application-Specific Predictable Networks-on-Chip

Author: Seiculescu Ciprian
Publication venue: Lausanne, EPFL
Publication date: 23/07/2012
Field of study

As the complexity of applications grows with each new generation, so does the demand for computation power. To satisfy the computation demands at manageable power levels, we see a shift in the design paradigm from single processor systems to Multiprocessor Systems-on-Chip (MPSoCs). MPSoCs leverage the parallelism in applications to increase the performance at the same power levels. To further improve the computation to power consumption ratio, MPSoCs for embedded applications are heterogeneous and integrate cores that are specialized to perform the different functionalities of the application. With technology scaling, wire power consumption is increasing compared to logic, making communication as expensive as computation. Therefore customizing the interconnect is necessary to achieve energy efficiency. Designing an optimal application specific Network-on-Chip (NoC), that meets application demands, requires the exploration of a large design space. Automatic design and optimization of the NoC is required in order to achieve fast design closure, especially for heterogeneous MPSoCs. To continue to meet the computation requirements of future applications new technologies are emerging. Three dimensional integration promises to increase the number of transistors by stacking multiple silicon layers. This will lead to an increase in the number of cores of the MPSoCs resulting in increased communication demands. To compensate for the increase in the wire delay in new technology nodes as well as to reduce the power consumption further, multi-synchronous design is becoming popular. With multiple clock signals, different parts of the MPSoC can be clocked at different frequencies according to the current demands of the application and can even be shutdown when they are not used at all. This further complicates the design of the NoC.Many applications require different levels of guarantee from the NoC in order to perform their functionality correctly. As communication traffic patterns become more complex, the performance of the NoC can no longer be predicted statically. Therefore designing the interconnect network requires that such guarantees are provided during the dynamic operation of the system which includes the interaction with major subsystems (i.e., main memory) and not just the interconnect itself. In this thesis, I present novel methods to design application-specific NoCs that meet performance demands, under the constraints of new technologies. To provide different levels of Quality of Service, I integrate methods to estimate the NoC performance during the design phase of the interconnect topology. I present methods and architectures for NoCs to efficiently access memory systems, in order to achieve predictable operation of the systems from the point of view of the communication as well as the bottleneck target devices. Therefore the main contribution of the thesis is twofold: scientific as I propose new algorithms to perform topology synthesis and engineering by presenting extensive experiments and architectures for NoC design

Infoscience - École polytechnique fédérale de Lausanne

Balancing reliability, cost, and performance tradeoffs with FreeFault

Author: Dong Wan Kim
Mattan Erez
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 04/12/2015
Field of study

Abstract—Memory errors have been a major source of system failures and fault rates may rise even further as memory continues to scale. This increasing fault rate, especially when combined with advent of integrated on-package memories, may exceed the capabilities of traditional fault tolerance mecha-nisms or significantly increase their overhead. In this paper, we present FreeFault as a hardware-only, transparent, and nearly-free resilience mechanism that is implemented entirely within a processor and can tolerate the majority of DRAM faults. FreeFault repurposes portions of the last-level cache for storing retired memory regions and augments a hardware memory scrubber to monitor memory health and aid retirement decisions. Because it relies on existing structures (cache associativity) for retirement/remapping type repair, FreeFault has essentially no hardware overhead. Because it requires a very modest portion of the cache (as small as 8KB) to cover a large fraction of DRAM faults, FreeFault has almost no impact on performance. We explain how FreeFault adds an attractive layer in an overall resilience scheme of highly-reliable and highly-available systems by delaying, and even entirely avoiding, calling upon software to make tradeoff decisions between memory capacity, performance, and reliability. I

CiteSeerX

Crossref

Recommended from our members

Performance-efficient mechanisms for managing irregularity in throughput processors

Author: Rhu Minsoo
Publication venue
Publication date: 01/07/2014
Field of study

textRecent graphics processing units (GPUs) have emerged as a promising platform for general purpose computing and have been shown to be very efficient in executing parallel applications with regular control and memory access behavior. Current GPU architectures primarily adopt the single-instruction multiple-thread (SIMT) programming model that balances programmability and hardware efficiency. With SIMT, the programmer writes application code to be executed by scalar threads and each thread is supported with conditional branch and fine-grained load/store instruction for ease of programming. At the same time, the hardware and software collaboratively enable the grouping of scalar threads to be executed in a vectorized single-instruction multiple-data (SIMD) in-order pipeline, simplifying hardware design. As GPUs gain momentum in being utilized in various application domains, these throughput processors will increasingly demand more efficient execution of irregular applications. Current GPUs, however, suffer from reduced thread-level parallelism, underutilization of compute resources, inefficient on-chip caching, and waste in off-chip memory bandwidth utilization for highly irregular programs with divergent control and memory accesses. In this dissertation, I develop techniques that enable simple, robust, and highly effective performance optimizations for SIMT-based throughput processor architectures such that they can better manage irregularity. I first identify that previously suggested optimizations to the divergent control flow problem suffers from the following limitations: 1) serialized execution of diverging paths, 2) lack of robustness across regular/irregular codes, and 3) limited applicability. Based on such observations, I propose and evaluate three novel mechanisms that resolve the aforementioned issues, providing significant performance improvements while minimizing implementation overhead. In the second half of the dissertation, I observe that conventional coarse-grained memory hierarchy designs do not take into account the massively multi-threaded nature of GPUs, which leads to substantial waste in off-chip memory bandwidth utilization. I design and evaluate a locality-aware memory hierarchy for throughput processors, which retains the advantages of coarse-grained accesses for spatially and temporally local programs while permitting selective fine-grained access to memory. By adaptively adjusting the access granularity, memory bandwidth and energy consumption are reduced for data with low spatial/temporal locality without wasting control overheads or prefetching potential for data with high spatial locality.Electrical and Computer Engineerin

Texas ScholarWorks

A Comprehensive Study of DRAM Controllers in Real-Time Systems

Author: Guo Danlu
Publication venue: 'University of Waterloo'
Publication date: 22/12/2016
Field of study

The DRAM main memory is a critical component and a performance bottleneck of almost all computing systems. Since the DRAM is a shared memory resource on multi-core plat- forms, all cores contend for the memory bandwidth. Therefore, there is a keen interest in the real-time community to design predictable DRAM controllers to provide a low memory access latency bound to meet the strict timing requirement of real-time applications. Due to the lack of generalization of publicly available DRAM controller models in full-system and DRAM device simulators, researchers often design in-house simulator to validate their designs. An extensible cycle-accurate DRAM controller simulation frame- work is developed to simplify the process of validating new DRAM controller designs. To prove the extensibility and reusability of the framework, ten state-of-the-art predictable DRAM controllers are implemented in the framework with less than 200 lines of new code. With the help of the framework, a comprehensive evaluation of state-of-the-art pre- dictable DRAM controllers is performed analytically and experimentally to show the im- pact of different system parameters. This extensive evaluation allows researchers to assess the contribution of state-of-the-art DRAM controller approaches. At last, a novel DRAM controller with request reordering technique is proposed to provide a configurable trade-off between latency bound and bandwidth in mixed-critical systems. Compared to the state-of-the-art DRAM controller, there is a balance point between the two designs which depends on the locality of the task under analysis and the DRAM device used in the system

University of Waterloo's Institutional Repository

Circuit Techniques for Adaptive and Reliable High Performance Computing.

Author: Giridhar Bharan
Publication venue
Publication date
Field of study

Increasing power density with process scaling has caused stagnation in the clock speed of modern microprocessors. Accordingly, designers have adopted message passing and shared memory based multicore architectures in order to keep up with the rapidly rising demand for computing throughput. At the same time, applications are not entirely parallel and improving single-thread performance continues to remain critical. Additionally, reliability is also worsening with process scaling, and margining for failures due to process and environmental variations in modern technologies consumes an increasingly large portion of the power/performance envelope. In the wake of multicore computing, reliability of signal synchronization between the cores is also becoming increasingly critical. This forces designers to search for alternate efficient methods to improve compute performance while addressing reliability. Accordingly, this dissertation presents innovative circuit and architectural techniques for variation-tolerance, performance and reliability targeted at datapath logic, signal synchronization and memories. Firstly, a domino logic based design style for datapath logic is presented that uses Adaptive Robustness Tuning (ART) in addition to timing speculation to provide up to 71% performance gains over conventional domino logic in 32bx32b multiplier in 65nm CMOS. Margins are reduced until functionality errors are detected, that are used to guide the tuning. Secondly, for signal synchronization across clock domains, a new class of dynamic logic based synchronizers with single-cycle synchronization latency is presented, where pulses, rather than stable intermediate voltages cause metastability. Such pulses are amplified using skewed inverters to improve mean time between failures by ~1e6x over jamb latches and double flip-flops at 2GHz in 65nm CMOS. Thirdly, a reconfigurable sensing scheme for 6T SRAMs is presented that employs auto-zero calibration and pre-amplification to improve sensing reliability (by up to 1.2 standard deviations of NMOS threshold voltage in 28nm CMOS); this increased reliability is in turn traded for ~42% sensing speedup. Finally, a main memory architecture design methodology to address reliability and power in the context of Exascale computing systems is presented. Based on 3D-stacked DRAMs, the methodology co-optimizes DRAM access energy, refresh power and the increased cost of error resilience, to meet stringent power and reliability constraints.PhDElectrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/107238/1/bharan_1.pd

Deep Blue Documents at the University of Michigan

Neural networks-on-chip for hybrid bio-electronic systems

Author: Coapes Graeme
Publication venue: Newcastle University
Publication date: 01/01/2016
Field of study

PhD ThesisBy modelling the brains computation we can further our understanding of its function and develop novel treatments for neurological disorders. The brain is incredibly powerful and energy e cient, but its computation does not t well with the traditional computer architecture developed over the previous 70 years. Therefore, there is growing research focus in developing alternative computing technologies to enhance our neural modelling capability, with the expectation that the technology in itself will also bene t from increased awareness of neural computational paradigms. This thesis focuses upon developing a methodology to study the design of neural computing systems, with an emphasis on studying systems suitable for biomedical experiments. The methodology allows for the design to be optimized according to the application. For example, di erent case studies highlight how to reduce energy consumption, reduce silicon area, or to increase network throughput. High performance processing cores are presented for both Hodgkin-Huxley and Izhikevich neurons incorporating novel design features. Further, a complete energy/area model for a neural-network-on-chip is derived, which is used in two exemplar case-studies: a cortical neural circuit to benchmark typical system performance, illustrating how a 65,000 neuron network could be processed in real-time within a 100mW power budget; and a scalable highperformance processing platform for a cerebellar neural prosthesis. From these case-studies, the contribution of network granularity towards optimal neural-network-on-chip performance is explored

Newcastle University eTheses