130 research outputs found

    A distributed interleaving scheme for efficient access to WideIO DRAM memory

    Get PDF
    Achieving the main memory (DRAM) required bandwidth at acceptable power levels for current and future applications is a major challenge for System-on-Chip designers for mobile platforms. Three dimensional (3D) integration and 3D stacked DRAM memories promise to provide a significant boost in bandwidth at low power levels by exploiting multiple channels and wide data interfaces. In this paper, we address the problem of efficiently exploiting the multiple channels provided by standard (JEDEC’s WIDEIO) 3D-stacked memories, to extract maximal effective bandwidth and minimize latency for main memory access. We propose a new distributed interleaved access method that leverages the on-chip interconnect to simplify the design and implementation of the DRAM controller, without impacting performance compared to traditional centralized implementations. We perform experiments on realistic workload for a mobile communication and multimedia platform and show that our proposed distributed interleaving memory access method improves the overall throughput while minimally impacting the performance of latency sensitive communication flows

    Temperature Variation Aware Energy Optimization in Heterogeneous MPSoCs

    Get PDF
    Thermal effects are rapidly gaining importance in nanometer heterogeneous integrated systems. Increased power density, coupled with spatio-temporal variability of chip workload, cause lateral and vertical temperature non-uniformities (variations) in the chip structure. The assumption of an uniform temperature for a large circuit leads to inaccurate determination of key design parameters. To improve design quality, we need precise estimation of temperature at detailed spatial resolution which is very computationally intensive. Consequently, thermal analysis of the designs needs to be done at multiple levels of granularity. To further investigate the flow of chip/package thermal analysis we exploit the Intel Single Chip Cloud Computer (SCC) and propose a methodology for calibration of SCC on-die temperature sensors. We also develop an infrastructure for online monitoring of SCC temperature sensor readings and SCC power consumption. Having the thermal simulation tool in hand, we propose MiMAPT, an approach for analyzing delay, power and temperature in digital integrated circuits. MiMAPT integrates seamlessly into industrial Front-end and Back-end chip design flows. It accounts for temperature non-uniformities and self-heating while performing analysis. Furthermore, we extend the temperature variation aware analysis of designs to 3D MPSoCs with Wide-I/O DRAM. We improve the DRAM refresh power by considering the lateral and vertical temperature variations in the 3D structure and adapting the per-DRAM-bank refresh period accordingly. We develop an advanced virtual platform which models the performance, power, and thermal behavior of a 3D-integrated MPSoC with Wide-I/O DRAMs in detail. Moving towards real-world multi-core heterogeneous SoC designs, a reconfigurable heterogeneous platform (ZYNQ) is exploited to further study the performance and energy efficiency of various CPU-accelerator data sharing methods in heterogeneous hardware architectures. A complete hardware accelerator featuring clusters of OpenRISC CPUs, with dynamic address remapping capability is built and verified on a real hardware

    Designing Low Cost Error Correction Schemes for Improving Memory Reliability

    Get PDF
    abstract: Memory systems are becoming increasingly error-prone, and thus guaranteeing their reliability is a major challenge. In this dissertation, new techniques to improve the reliability of both 2D and 3D dynamic random access memory (DRAM) systems are presented. The proposed schemes have higher reliability than current systems but with lower power, better performance and lower hardware cost. First, a low overhead solution that improves the reliability of commodity DRAM systems with no change in the existing memory architecture is presented. Specifically, five erasure and error correction (E-ECC) schemes are proposed that provide at least Chipkill-Correct protection for x4 (Schemes 1, 2 and 3), x8 (Scheme 4) and x16 (Scheme 5) DRAM systems. All schemes have superior error correction performance due to the use of strong symbol-based codes. In addition, the use of erasure codes extends the lifetime of the 2D DRAM systems. Next, two error correction schemes are presented for 3D DRAM memory systems. The first scheme is a rate-adaptive, two-tiered error correction scheme (RATT-ECC) that provides strong reliability (10^10x) reduction in raw FIT rate) for an HBM-like 3D DRAM system that services CPU applications. The rate-adaptive feature of RATT-ECC enables permanent bank failures to be handled through sparing. It can also be used to significantly reduce the refresh power consumption without decreasing the reliability and timing performance. The second scheme is a two-tiered error correction scheme (Config-ECC) that supports different sized accesses in GPU applications with strong reliability. It addresses the mismatch between data access size and fixed sized ECC scheme by designing a product code based flexible scheme. Config-ECC is built around a core unit designed for 32B access with a simple extension to support 64B and 128B accesses. Compared to fixed 32B and 64B ECC schemes, Config-ECC reduces the failure in time (FIT) rate by 200x and 20x, respectively. It also reduces the memory energy by 17% (in the dynamic mode) and 21% (in the static mode) compared to a state-of-the-art fixed 64B ECC scheme.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

    펄스 기반 피드 포워드 이퀄라이저를 갖춘 고용량 DRAM을 위한 컨트롤러 PHY 설계

    Get PDF
    학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·정보공학부, 2020. 8. 김수환.A controller PHY for managed DRAM solution, which is a new memory structure to maximize capacity while minimizing refresh power, is presented. Inter-symbol interference is critical in such a high-capacity DRAM interface in which many DRAM chips share a command/address (C/A) channel. A pulse-based feed-forward equalizer (PB-FFE) is introduced to reduce ISI on a C/A channel. The controller PHY supports all the training sequences specified in the DDR4 standard. A glitch-free DCDL is also adopted to perform link training efficiently and to reduce training time. The DQ transmitter adopts quarter-rate architecture to reduce output latency. For the quarter-rate transmitters in DQ, we propose a quadrature error corrector (QEC), in which clock signal phase errors are corrected using two replicas of the 4:1 serializer of the output stage. Pulse shrinking is used to compare and equalize the outputs of these two replica serializers. A controller PHY was fabricated in 55nm CMOS. The PB-FFE increases the timing margin from 0.23UI to 0.29UI at 1067Mbps. At 2133Mbps, the read timing and voltage margins are 0.53UI and 211mV after read training, and the write margins are 0.72UI and 230mV after write training. To validate the QEC effectiveness, a prototype quarter-rate transmitter, including the QEC, was fabricated to another chip in 65nm CMOS. Adopting our QEC, the experimental results show that the output phase errors of the transmitter are reduced to a residual error of 0.8ps, and the output eye width and height are improved by 84% and 61%, respectively, at a data-rate of 12.8Gbps.본 연구에서 용량을 최대화하면서도 리프레시 전력을 최소화할 수 있는 새로운 메모리 구조인 관리형 DRAM 솔루션을 위한 컨트롤러 PHY를 제시하였다. 이와 같은 고용량 DRAM 인터페이스에서는 많은 DRAM 칩이 명령 / 주소 (C/A) 채널을 공유하고 있어서 심볼 간 간섭이 발생한다. 본 연구에서는 이러한 C/A 채널에서의 심볼 간 간섭을 줄이기 위해 펄스 기반 피드 포워드 이퀄라이저 (PB-FFE)를 채택하였다. 또한 본 연구의 컨트롤러 PHY는 DDR4 표준에 지정된 모든 트레이닝 시퀀스를 지원한다. 링크 트레이닝을 효율적으로 수행하고 트레이닝 시간을 줄이기 위해 글리치가 발생하지 않는 디지털 제어 지연 라인 (DCDL)을 채택하였다. 컨트롤러 PHY의 DQ 송신기는 출력 대기 시간을 줄이기 위해 쿼터 레이트 구조를 채택하였다. 쿼터 레이트 송신기의 경우에는 직교 클럭 간 위상 오류가 출력 신호의 무결성에 영향을 주게 된다. 이러한 영향을 최소화하기 위해 본 연구에서는 출력 단의 4 : 1 직렬 변환기의 두 복제본을 사용하여 클록 신호 위상 오류를 수정하는 QEC (Quadrature Error Corrector)를 제안하였다. 복제된 2개의 직렬 변환기의 출력을 비교하고 균등화하기 위해 펄스 수축 지연 라인이 사용되었다. 컨트롤러 PHY는 55nm CMOS 공정으로 제조되었다. PB-FFE는 1067Mbps에서 C/A 채널 타이밍 마진을 0.23UI에서 0.29UI로 증가시킨다. 읽기 트레이닝 후 읽기 타이밍 및 전압 마진은 2133Mbps에서 0.53UI 및 211mV이고, 쓰기 트레이닝 후 쓰기 마진은 0.72UI 및 230mV이다. QEC의 효과를 검증하기 위해 QEC를 포함한 프로토 타입 쿼터 레이트 송신기를 65nm CMOS의 다른 칩으로 제작하였다. QEC를 적용한 실험 결과, 송신기의 출력 위상 오류가 0.8ps의 잔류 오류로 감소하고, 출력 데이터 눈의 폭과 높이가 12.8Gbps의 데이터 속도에서 각각 84 %와 61 % 개선되었음을 보여준다.CHAPTER 1 INTRODUCTION 1 1.1 MOTIVATION 1 1.1.1 HEAVY LOAD C/A CHANNEL 5 1.1.2 QUARTER-RATE ARCHITECTURE IN DQ TRANSMITTER 7 1.1.3 SUMMARY 8 1.2 THESIS ORGANIZATION 10 CHAPTER 2 ARCHITECTURE 11 2.1 MDS DIMM STRUCTURE 11 2.2 MDS CONTROLLER 15 2.3 MDS CONTROLLER PHY 17 2.3.1 INITIALIZATION SEQUENCE 20 2.3.2 LINK TRAINING FINITE-STATE MACHINE 23 2.3.3 POWER DOWN MODE 28 CHAPTER 3 PULSE-BASED FEED-FORWARD EQUALIZER 29 3.1 COMMAND/ADDRESS CHANNEL 29 3.2 COMMAND/ADDRESS TRANSMITTER 33 3.3 PULSE-BASED FEED-FORWARD EQUALIZER 35 CHAPTER 4 CIRCUIT IMPLEMENTATION 39 4.1 BUILDING BLOCKS 39 4.1.1 ALL-DIGITAL PHASE-LOCKED LOOP (ADPLL) 39 4.1.2 ALL-DIGITAL DELAY-LOCKED LOOP (ADDLL) 44 4.1.3 GLITCH-FREE DCDL CONTROL 47 4.1.4 DUTY-CYCLE CORRECTOR (DCC) 50 4.1.5 DQ/DQS TRANSMITTER 52 4.1.6 DQ/DQS RECEIVER 54 4.1.7 ZQ CALIBRATION 56 4.2 MODELING AND VERIFICATION OF LINK TRAINING 59 4.3 BUILT-IN SELF-TEST CIRCUITS 66 CHAPTER 5 QUADRATURE ERROR CORRECTOR USING REPLICA SERIALIZERS AND PULSE-SHRINKING DELAY LINES 69 5.1 PHASE CORRECTION USING REPLICA SERIALIZERS AND PULSE-SHRINKING UNITS 69 5.2 OVERALL QEC ARCHITECTURE AND ITS OPERATION 71 5.3 FINE DELAY UNIT IN THE PSDL 76 CHAPTER 6 EXPERIMENTAL RESULTS 78 6.1 CONTROLLER PHY 78 6.2 PROTOTYPE QEC 88 CHAPTER 7 CONCLUSION 94 BIBLIOGRAPHY 96Docto

    An Energy-Efficient Reconfigurable Mobile Memory Interface for Computing Systems

    Get PDF
    The critical need for higher power efficiency and bandwidth transceiver design has significantly increased as mobile devices, such as smart phones, laptops, tablets, and ultra-portable personal digital assistants continue to be constructed using heterogeneous intellectual properties such as central processing units (CPUs), graphics processing units (GPUs), digital signal processors, dynamic random-access memories (DRAMs), sensors, and graphics/image processing units and to have enhanced graphic computing and video processing capabilities. However, the current mobile interface technologies which support CPU to memory communication (e.g. baseband-only signaling) have critical limitations, particularly super-linear energy consumption, limited bandwidth, and non-reconfigurable data access. As a consequence, there is a critical need to improve both energy efficiency and bandwidth for future mobile devices.;The primary goal of this study is to design an energy-efficient reconfigurable mobile memory interface for mobile computing systems in order to dramatically enhance the circuit and system bandwidth and power efficiency. The proposed energy efficient mobile memory interface which utilizes an advanced base-band (BB) signaling and a RF-band signaling is capable of simultaneous bi-directional communication and reconfigurable data access. It also increases power efficiency and bandwidth between mobile CPUs and memory subsystems on a single-ended shared transmission line. Moreover, due to multiple data communication on a single-ended shared transmission line, the number of transmission lines between mobile CPU and memories is considerably reduced, resulting in significant technological innovations, (e.g. more compact devices and low cost packaging to mobile communication interface) and establishing the principles and feasibility of technologies for future mobile system applications. The operation and performance of the proposed transceiver are analyzed and its circuit implementation is discussed in details. A chip prototype of the transceiver was implemented in a 65nm CMOS process technology. In the measurement, the transceiver exhibits higher aggregate data throughput and better energy efficiency compared to prior works

    Refresh Triggered Computation: Improving the Energy Efficiency of Convolutional Neural Network Accelerators

    Full text link
    To employ a Convolutional Neural Network (CNN) in an energy-constrained embedded system, it is critical for the CNN implementation to be highly energy efficient. Many recent studies propose CNN accelerator architectures with custom computation units that try to improve energy-efficiency and performance of CNNs by minimizing data transfers from DRAM-based main memory. However, in these architectures, DRAM is still responsible for half of the overall energy consumption of the system, on average. A key factor of the high energy consumption of DRAM is the refresh overhead, which is estimated to consume 40% of the total DRAM energy. In this paper, we propose a new mechanism, Refresh Triggered Computation (RTC), that exploits the memory access patterns of CNN applications to reduce the number of refresh operations. We propose three RTC designs (min-RTC, mid-RTC, and full-RTC), each of which requires a different level of aggressiveness in terms of customization to the DRAM subsystem. All of our designs have small overhead. Even the most aggressive RTC design (i.e., full-RTC) imposes an area overhead of only 0.18% in a 16 Gb DRAM chip and can have less overhead for denser chips. Our experimental evaluation on six well-known CNNs show that RTC reduces average DRAM energy consumption by 24.4% and 61.3%, for the least aggressive and the most aggressive RTC implementations, respectively. Besides CNNs, we also evaluate our RTC mechanism on three workloads from other domains. We show that RTC saves 31.9% and 16.9% DRAM energy for Face Recognition and Bayesian Confidence Propagation Neural Network (BCPNN), respectively. We believe RTC can be applied to other applications whose memory access patterns remain predictable for a sufficiently long time

    Scalable and bandwidth-efficient memory subsystem design for real-time systems

    Get PDF

    Memory systems for high-performance computing: the capacity and reliability implications

    Get PDF
    Memory systems are signicant contributors to the overall power requirements, energy consumption, and the operational cost of large high-performance computing systems (HPC). Limitations of main memory systems in terms of latency, bandwidth and capacity, can signicantly affect the performance of HPC applications, and can have strong negative impact on system scalability. In addition, errors in the main memory system can have a strong impact on the reliability, accessibility and serviceability of large-scale clusters. This thesis studies capacity and reliability issues in modern memory systems for high-performance computing. The choice of main memory capacity is an important aspect of high-performance computing memory system design. This choice becomes in- creasingly important now that 3D-stacked memories are entering the market. Compared with conventional DIMMs, 3D memory chiplets provide better performance and energy efficiency but lower memory capacities. Therefore the adoption of 3D-stacked memories in the HPC domain depends on whether we can find use cases that require much less memory than is available now. We analyze memory capacity requirements of important HPC benchmarks and applications. The study identifies the HPC applications and use cases with memory footprints that could be provided by 3D-stacked memory chiplets, making a first step towards the adoption of this novel technology in the HPC domain. For HPC domains where large memory capacities are required, we propose scaling-in of HPC applications to reduce energy consumption and the running time of a batch of jobs. We also propose upgrading the per-node memory capacity, which enables greater degree of scaling-in and additional energy savings. Memory system is one of the main causes of hardware failures. In each generation, the DRAM chip density and the amount of the memory in systems increase, while the DRAM technology process is constantly shrinking. Therefore, we could expect that the DRAM failures could have a serious impact on the future-systems reliability. This thesis studies DRAM errors observed on a production HPC system during a period of two years. We clearly distinguish between two different approaches for the DRAM error analysis: categorical analysis and the analysis of error rates. The first approach compares the errors at the DIMM level and partitions the DIMMs into various categories, e.g. based on whether they did or did not experience an error. The second approach is to analyze the error rates, i.e., to present the total number of errors relative to other statistics, typically the number of MB-hours or the duration of the observation period. We show that although DRAM error analysis may be performed with both approaches, they are not interchangeable and can lead to completely different conclusions. We further demonstrate the importance of providing statistical significance and presenting results that have practical value and real-life use. We show that various widely-accepted approaches for DRAM error analysis may provide data that appear to support an interesting conclusion, but are not statistically signifcant, meaning that they could merely be the result of chance. We hope the study of methods for DRAM error analysis presented in this thesis will become a standard for any future analysis of DRAM errors in the field.Los sistemas de memoria son contribuyentes significativos al consumo de energía y al coste de operación de los sistemas de computación de altas prestaciones (HPC). Limitaciones de los sistemas de memoria en términos de latencia, ancho de banda y capacidad, pueden afectar significativamente el rendimiento de aplicaciones HPC, y pueden tener un fuerte impacto negativo en la escalabilidad del sistema. Además, los errores en el sistema de memoria principal pueden tener un fuerte impacto sobre la confiabilidad, disponibilidad y capacidad de servicio de los clusters a gran escala. Esta tesis estudia problemas de capacidad y confiabilidad de los sistemas modernos de computación de altas prestaciones. La elección de capacidad de la memoria principal es un aspecto importante del diseño de sistemas de computación de altas prestaciones. Esta elección empieza ser cada vez más importante con memorias 3D apareciendo en el mercado. Comparados con los DIMMs convencionales, los chips de memoria 3D proporcionan mejor rendimiento y eficiencia energética, pero menores capacidades de memoria. Por lo tanto, la adopción de memorias 3D en el dominio HPC depende de si es posible encontrar casos de uso que requieren mucha menos memoria de la que está disponible ahora. Analizamos los requisitos de capacidad de memoria de importantes benchmarks y aplicaciones de HPC. El estudio identifica las aplicaciones de HPC y los casos de uso con huellas de memoria que podrían ser proporcionadas por los chips de memoria 3D dando un primer paso hacia la adopción de esta nueva tecnología en el dominio HPC. Para dominios HPC donde se requieren grandes capacidades de memoria, proponemos scaling-in de las aplicaciones de HPC para reducir el consumo de energía y el tiempo de ejecución de un lote de tareas. También proponemos ampliar la capacidad de memoria que permite un mayor grado de scaling-in y ahorros de energía adicionales. El sistema de memoria es una de las principales causas de fallas de hardware. En cada generación, la densidad del chip DRAM y la cantidad de memoria en el sistema aumentan, mientras el proceso de tecnología DRAM se reduce constantemente. Por lo tanto, podríamos esperar que los fallos DRAM podrían tener un serio impacto en la confiabilidad de los sistemas en el futuro. Esta tesis estudia los errores de DRAM observados en un sistema de producción HPC durante un período de dos años. Nosotros distinguimos claramente dos enfoques diferentes de análisis de error DRAM: análisis categórico y análisis de tasas de error. El primer enfoque compara los errores en el nivel DIMM y divide los DIMMs en varias categorías, por ejemplo, dependiendo si tuvieron o no un error. El segundo enfoque es analizar las tasas de error, es decir, presentar el número total de errores relativos a otras estadísticas, generalmente el número de MB-horas o la duración del período de observación. Mostramos que aunque el análisis de error DRAM se puede realizar con ambos enfoques, estos no son intercambiables y pueden llevar a conclusiones completamente diferentes. Demostramos la importancia de proporcionar significación estadística y presentar resultados que tienen un valor práctico y uso en la vida real. Mostramos que varios enfoques de análisis de errores de DRAM pueden proporcionar datos que apoyan una conclusión interesante, pero no son estadísticamente significativos, lo que significa que simplemente podrían ser el resultado de casualidad. Esperamos que el estudio de los métodos para el análisis de errores DRAM presentados en esta tesis se convertirá en un estándar para cualquier análisis futuro de errores de DRAM en el campo.Postprint (published version
    corecore