Search CORE

80 research outputs found

메모리 인터페이스를 위한 멀티 레벨 단일 종단 송신기 설계

Author: 정용운
Publication venue: 서울대학교 대학원
Publication date: 01/08/2020
Field of study

학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2020. 8. 김수환.본 연구에서 메모리 인터페이스를 위한 멀티 레벨 송신기가 제시되었다. 프로세서와 메모리 간의 성능 차이가 매년 계속 증가함에 따라, 메모리는 전체 시스템의 병목점이 되고있다. 우리는 메모리 대역폭을 늘리기 위해 PAM-4 단일 종단 송신기를 제안하였고, 멀티 랭크 메모리를 위한 duobinary 단일 종단 송신기를 제안하였다. 제안된 PAM-4 송신기의 드라이버는 높은 선형성과 임피던스 정합을 동시에 만족한다. 또한 저항이나 인덕터를 사용하지 않아 작은 면적을 차지한다. 제안된 ZQ 캘리브레이션은 세개의 교정 점을 가지고 있어 송신기가 정확한 임피던스와 선형적인 출력을 갖게 한다. 프로토 타입은 65nm CMOS 공정으로 제작되었고 송신기는 0.0333mm2의 면적을 차지한다. 측정된 28Gb/s에서의 eye는 18.3ps의 길이와 42.4mV의 높이를 갖고, 에너지 효율은 0.64pJ/bit이다. ZQ 캘리브레이션과 함께 측정된 RLM은 0.993이다. 메모리의 용량을 늘리기 위해 하나의 패키지에 여러 개의 DRAM 다이를 수직으로 쌓는 패키징은 메모리의 중앙 패드 구조와 결합되어 짧은 반사를 야기하는 스텁을 만든다. 우리는 이 문제를 완화하기위해 반사 기반 duobinary 송신기를 제안했다. 이 송신기는 반사를 이용하여 duobinary signaling을 한다. 2탭 반대 강조 기술과 슬루 레이트 조절 기술이 신호 완결성을 높이기 위해 사용되었다. NRZ eye가 없는 10Gb/s에서 측정된 duobinary eye는 63.6ps 길이와 70.8mV의 높이를 갖는다. 측정된 에너지 효율은 1.38pJ/bit이다.Multi-level transmitters for memory interfaces have been presented. The performance gap between processor and memory has been increased by 50% every year, making memory to be a bottle neck of the overall system. To increase memory bandwidth, we have proposed a PAM-4 single-ended transmitter. To compensate for the side effect of the multi-rank memory, we have proposed a reflection-based duobinary transmitter. The proposed PAM-4 transmitter has the driver, which simultaneously satisfies impedance matching and high linearity. The driver occupies a small area due to a resistorless and inductorless structure. The proposed ZQ calibration for PAM-4 has three calibration points, which allow the transmitter to have accurate impedance and linear output. The ZQ calibration considers impedance variation of both the driver and the receiver. A prototype has been fabricated in 65nm CMOS process, and the transmitter occupies 0.0333mm2. The measured eye has a width of 18.3ps and a height of 42.4mV at 28Gb/s, and the measured energy efficiency is 0.64pJ/b. The measured RLM with the 3-point ZQ calibration is 0.993. To increase memory density, the stacked die packaging with multiple DRAM die stacked vertically in one package is widely used. However, combined with the center-pad structure, the structure creates stubs that cause short reflections. We have proposed the reflection-based duobinary transmitter to mitigate this problem. The proposed transmitter uses reflection for duobinary signaling. The 2-tap opposite FFE and the slew-rate control are used to increase signal integrity. The measured duobinary eye at 10Gb/s has a width of 63.6ps and a height of 70.8mV while there is no NRZ eye opening. The measured energy efficiency is 1.38pJ/bit.CHAPTER 1 INTRODUCTION 1 1.1 MOTIVATION 1 1.2 THESIS ORGANIZATION 8 CHAPTER 2 MUTI-LEVEL SIGNALING 9 2.1 PAM-4 SIGNALING 9 2.2 DESIGN CONSIDERATIONS FOR PAM-4 TRANSMITTER 16 2.2.1 LEVEL SEPARATION MISMATCH RATIO (RLM) 17 2.2.2 IMPEDANCE MATCHING 19 2.2.3 PRIOR ARTS 21 2.3 DUOBINARY SIGNALING 24 CHAPTER 3 HIGH-LINEARITY AND IMPEDANCE-MATCHED PAM-4 TRANSMITTER 30 3.1 OVERALL ARCHITECTURE 31 3.2 SINGLE-ENDED IMPEDANCE-MATCHED PAM-4 DRIVER 33 3.3 3-POINT ZQ CALIBRATION FOR PAM-4 47 CHAPTER 4 REFLECTION-BASED DUOBINARY TRANSMITTER 57 4.1 BIDIRECTIONAL DUAL-RANK MEMORY SYSTEM 58 4.2 CONCEPT OF REFLECTION-BASED DUOBINARY SIGNALING 66 4.3 REFLECTION-BASED DUOBINARY TRANSMITTER 70 4.3.1 OVERALL ARCHITECTURE 70 4.3.2 EQUALIZATION FOR REFLECTION-BASED DUOBINARY SIGNALING 72 4.3.3 2D BINARY-SEGMENTED DRIVER 75 CHAPTER 5 EXPERIMENTAL RESULTS 77 5.1 HIGH-LINEARITY AND IMPEDANCE-MATCHED PAM-4 TRANSMITTER 77 5.2 REFLECTION-BASED DUOBINARY TRANSMITTER 84 CHAPTER 6 92 CONCLUSION 92 BIBLIOGRAPHY 94Docto

SNU Open Repository and Archive

CXL Memory as Persistent Memory for Disaggregated HPC: A Practical Approach

Author: Desai Suprasad Mutalik
Fridman Yehonatan
Oren Gal
Singh Navneet
Willhalm Thomas
Publication venue
Publication date: 21/08/2023
Field of study

In the landscape of High-Performance Computing (HPC), the quest for efficient and scalable memory solutions remains paramount. The advent of Compute Express Link (CXL) introduces a promising avenue with its potential to function as a Persistent Memory (PMem) solution in the context of disaggregated HPC systems. This paper presents a comprehensive exploration of CXL memory's viability as a candidate for PMem, supported by physical experiments conducted on cutting-edge multi-NUMA nodes equipped with CXL-attached memory prototypes. Our study not only benchmarks the performance of CXL memory but also illustrates the seamless transition from traditional PMem programming models to CXL, reinforcing its practicality. To substantiate our claims, we establish a tangible CXL prototype using an FPGA card embodying CXL 1.1/2.0 compliant endpoint designs (Intel FPGA CXL IP). Performance evaluations, executed through the STREAM and STREAM-PMem benchmarks, showcase CXL memory's ability to mirror PMem characteristics in App-Direct and Memory Mode while achieving impressive bandwidth metrics with Intel 4th generation Xeon (Sapphire Rapids) processors. The results elucidate the feasibility of CXL memory as a persistent memory solution, outperforming previously established benchmarks. In contrast to published DCPMM results, our CXL-DDR4 memory module offers comparable bandwidth to local DDR4 memory configurations, albeit with a moderate decrease in performance. The modified STREAM-PMem application underscores the ease of transitioning programming models from PMem to CXL, thus underscoring the practicality of adopting CXL memory.Comment: 12 pages, 9 figure

arXiv.org e-Print Archive

PIM-Enclave: Bringing Confidential Computation Inside Memory

Author: Duy Kha Dinh
Lee Hojoon
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 05/11/2021
Field of study

Demand for data-intensive workloads and confidential computing are the prominent research directions shaping the future of cloud computing. Computer architectures are evolving to accommodate the computing of large data better. Protecting the computation of sensitive data is also an imperative yet challenging objective; processor-supported secure enclaves serve as the key element in confidential computing in the cloud. However, side-channel attacks are threatening their security boundaries. The current processor architectures consume a considerable portion of its cycles in moving data. Near data computation is a promising approach that minimizes redundant data movement by placing computation inside storage. In this paper, we present a novel design for Processing-In-Memory (PIM) as a data-intensive workload accelerator for confidential computing. Based on our observation that moving computation closer to memory can achieve efficiency of computation and confidentiality of the processed information simultaneously, we study the advantages of confidential computing \emph{inside} memory. We then explain our security model and programming model developed for PIM-based computation offloading. We construct our findings into a software-hardware co-design, which we call PIM-Enclave. Our design illustrates the advantages of PIM-based confidential computing acceleration. Our evaluation shows PIM-Enclave can provide a side-channel resistant secure computation offloading and run data-intensive applications with negligible performance overhead compared to baseline PIM model

arXiv.org e-Print Archive

Collaborative Acceleration for FFT on Commercial Processing-In-Memory Architectures

Author: Aga Shaizeen
Ibrahim Mohamed Assem
Publication venue
Publication date: 07/08/2023
Field of study

This paper evaluates the efficacy of recent commercial processing-in-memory (PIM) solutions to accelerate fast Fourier transform (FFT), an important primitive across several domains. Specifically, we observe that efficient implementations of FFT on modern GPUs are memory bandwidth bound. As such, the memory bandwidth boost availed by commercial PIM solutions makes a case for PIM to accelerate FFT. To this end, we first deduce a mapping of FFT computation to a strawman PIM architecture representative of recent commercial designs. We observe that even with careful data mapping, PIM is not effective in accelerating FFT. To address this, we make a case for collaborative acceleration of FFT with PIM and GPU. Further, we propose software and hardware innovations which lower PIM operations necessary for a given FFT. Overall, our optimized PIM FFT mapping, termed Pimacolaba, delivers performance and data movement savings of up to 1.38

\times

and 2.76

\times

, respectively, over a range of FFT sizes

arXiv.org e-Print Archive

HPC memory systems: Implications of system simulation and checkpointing

Author: Sánchez Verdejo Rommel
Publication venue: Universitat Politècnica de Catalunya
Publication date: 04/02/2022
Field of study

The memory system is a significant contributor for most of the current challenges in computer architecture: application performance bottlenecks and operational costs in large data-centers as HPC supercomputers. With the advent of emerging memory technologies, the exploration for novel designs on the memory hierarchy for HPC systems is an open invitation for computer architecture researchers to improve and optimize current designs and deployments. System simulation is the preferred approach to perform architectural explorations due to the low cost to prototype hardware systems, acceptable performance estimates, and accurate energy consumption predictions. Despite the broad presence and extensive usage of system simulators, their validation is not standardized; either because the main purpose of the simulator is not meant to mimic real hardware, or because the design assumptions are too narrow on a particular computer architecture topic. This thesis provides the first steps for a systematic methodology to validate system simulators when compared to real systems. We unveil real-machine´s micro-architectural parameters through a set of specially crafted micro-benchmarks. The unveiled parameters are used to upgrade the simulation infrastructure in order to obtain higher accuracy in the simulation domain. To evaluate the accuracy on the simulation domain, we propose the retirement factor, an extension to a well-known application´s performance methodology. Our proposal provides a new metric to measure the impact simulator´s parameter-tuning when looking for the most accurate configuration. We further present the delay queue, a modification to the memory controller that imposes a configurable delay for all memory transactions that reach the main memory devices; evaluated using the retirement factor, the delay queue allows us to identify the sources of deviations between the simulator infrastructure and the real system. Memory accesses directly affect application performance, both in the real-world machine as well as in the simulation accuracy. From single-read access to a unique memory location up to simultaneous read/write operations to a single or multiple memory locations, HPC applications memory usage differs from workload to workload. A property that allows to glimpse on the application´s memory usage is the workload´s memory footprint. In this work, we found a link between HPC workload´s memory footprint and simulation performance. Actual trends on HPC data-center memory deployments and current HPC application’s memory footprint led us to envision an opportunity for emerging memory technologies to include them as part of the reliability support on HPC systems. Emerging memory technologies such as 3D-stacked DRAM are getting deployed in current HPC systems but in limited quantities in comparison with standard DRAM storage making them suitable to use for low memory footprint HPC applications. We exploit and evaluate this characteristic enabling a Checkpoint-Restart library to support a heterogeneous memory system deployed with an emerging memory technology. Our implementation imposes negligible overhead while offering a simple interface to allocate, manage, and migrate data sets between heterogeneous memory systems. Moreover, we showed that the usage of an emerging memory technology it is not a direct solution to performance bottlenecks; correct data placement and crafted code implementation are critical when comes to obtain the best computing performance. Overall, this thesis provides a technique for validating main memory system simulators when integrated in a simulation infrastructure and compared to real systems. In addition, we explored a link between the workload´s memory footprint and simulation performance on current HPC workloads. Finally, we enabled low memory footprint HPC applications with resilience support while transparently profiting from the usage of emerging memory deployments.El sistema de memoria es el mayor contribuidor de los desafíos actuales en el campo de la arquitectura de ordenadores como lo son los cuellos de botella en el rendimiento de las aplicaciones, así como los costos operativos en los grandes centros de datos. Con la llegada de tecnologías emergentes de memoria, existe una invitación para que los investigadores mejoren y optimicen las implementaciones actuales con novedosos diseños en la jerarquía de memoria. La simulación de los ordenadores es el enfoque preferido para realizar exploraciones de arquitectura debido al bajo costo que representan frente a la realización de prototipos físicos, arrojando estimaciones de rendimiento aceptables con predicciones precisas. A pesar del amplio uso de simuladores de ordenadores, su validación no está estandarizada ya sea porque el propósito principal del simulador no es imitar al sistema real o porque las suposiciones de diseño son demasiado específicas. Esta tesis proporciona los primeros pasos hacia una metodología sistemática para validar simuladores de ordenadores cuando son comparados con sistemas reales. Primero se descubren los parámetros de microarquitectura en la máquina real a través de un conjunto de micro-pruebas diseñadas para actualizar la infraestructura de simulación con el fin de mejorar la precisión en el dominio de la simulación. Para evaluar la precisión de la simulación, proponemos "el factor de retiro", una extensión a una conocida herramienta para medir el rendimiento de las aplicaciones, pero enfocada al impacto del ajuste de parámetros en el simulador. Además, presentamos "la cola de retardo", una modificación virtual al controlador de memoria que agrega un retraso configurable a todas las transacciones de memoria que alcanzan la memoria principal. Usando el factor de retiro, la cola de retraso nos permite identificar el origen de las desviaciones entre la infraestructura del simulador y el sistema real. Todos los accesos de memoria afectan directamente el rendimiento de la aplicación. Desde el acceso de lectura a una única localidad memoria hasta operaciones simultáneas de lectura/escritura a una o varias localidades de memoria, una propiedad que permite reflejar el uso de memoria de la aplicación es su "huella de memoria". En esta tesis encontramos un vínculo entre la huella de memoria de las aplicaciones de alto desempeño y su rendimiento en simulación. Las tecnologías de memoria emergentes se están implementando en sistemas de alto desempeño en cantidades limitadas en comparación con la memoria principal haciéndolas adecuadas para su uso en aplicaciones con baja huella de memoria. En este trabajo, habilitamos y evaluamos el uso de un sistema de memoria heterogéneo basado en un sistema emergente de memoria. Nuestra implementación agrega una carga despreciable al mismo tiempo que ofrece una interfaz simple para ubicar, administrar y migrar datos entre sistemas de memoria heterogéneos. Además, demostramos que el uso de una tecnología de memoria emergente no es una solución directa a los cuellos de botella en el desempeño. La implementación es fundamental a la hora de obtener el mejor rendimiento ya sea ubicando correctamente los datos, o bien diseñando código especializado. En general, esta tesis proporciona una técnica para validar los simuladores respecto al sistema de memoria principal cuando se integra en una infraestructura de simulación y se compara con sistemas reales. Además, exploramos un vínculo entre la huella de memoria de la carga de trabajo y el rendimiento de la simulación en cargas de trabajo de aplicaciones de alto desempeño. Finalmente, habilitamos aplicaciones de alto desempeño con soporte de resiliencia mientras que se benefician de manera transparente con el uso de un sistema de memoria emergente.Postprint (published version

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Architectural Techniques to Enable Reliable and Scalable Memory Systems

Author: Nair Prashant J.
Publication venue
Publication date: 13/04/2017
Field of study

High capacity and scalable memory systems play a vital role in enabling our desktops, smartphones, and pervasive technologies like Internet of Things (IoT). Unfortunately, memory systems are becoming increasingly prone to faults. This is because we rely on technology scaling to improve memory density, and at small feature sizes, memory cells tend to break easily. Today, memory reliability is seen as the key impediment towards using high-density devices, adopting new technologies, and even building the next Exascale supercomputer. To ensure even a bare-minimum level of reliability, present-day solutions tend to have high performance, power and area overheads. Ideally, we would like memory systems to remain robust, scalable, and implementable while keeping the overheads to a minimum. This dissertation describes how simple cross-layer architectural techniques can provide orders of magnitude higher reliability and enable seamless scalability for memory systems while incurring negligible overheads.Comment: PhD thesis, Georgia Institute of Technology (May 2017

arXiv.org e-Print Archive

Scholarly Materials And Research @ Georgia Tech