Search CORE

763 research outputs found

Recommended from our members

A survey of behavioral-level partitioning systems

Author: Vahid Frank
Publication venue: eScholarship, University of California
Publication date: 30/10/1991
Field of study

Many approaches have been developed to partition a system's behavioral description before a structural implementation is synthesized. We highlight the foundations and motivations for behavioral partitioning. We survey behavioral partitioning approaches, discussing abstraction levels, goals, major steps, and key assumptions in each

eScholarship - University of California

Application specific asynchronous microengines for efficient high-level control

Author: Jacobson Hans
Publication venue: University of Utah
Publication date: 01/01/1997
Field of study

technical reportDespite the growing interest in asynchronous circuits programmable asynchronous controllers based on the idea of microprogramming have not been actively pursued Since programmable control is widely used in many commercial ASICs to allow late correction of design errors to easily upgrade product families to meet the time to market and even efficient run time modications to control in adaptive systems we consider it crucial that self timed techniques support efficient programmable control This is especially true given that asynchronous (self-timed) circuits are well suited for realizing reactive and control intensive designs We offer a practical solution to programmable asynchronous control in the form of application-speciffic microprogrammed asynchronous controllers (or microengines). The features of our solution include a modular and easily extensible datapath structure support for two main styles of hand shaking (namely two phase and four phase), and many efficiency measures based on exploiting concurrency between operations and employing efficient circuit structures Our results demonstrate that the proposed microengine can yield high performance-in fact performance close to that offered by automated high level synthesis tools targeting custom hard wired burstmode machines

메모리 집약적 기계학습 응용프로그램을 위한 디램 기반 프로세싱 인 메모리 마이크로아키텍처

Author: 김병호
Publication venue: 서울대학교 대학원
Publication date: 01/02/2022
Field of study

학위논문(박사) -- 서울대학교대학원 : 융합과학기술대학원 융합과학부(지능형융합시스템전공), 2022.2. 안정호.Recently, as research on neural networks has gained significant traction, a number of memory-intensive neural network models such as recurrent neural network (RNN) models and recommendation models are introduced to process various tasks. RNN models and recommendation models spend most of their execution time processing matrix-vector multiplication (MV-mul) and processing embedding layers, respectively. A fundamental primitive of embedding layers, tensor gather-and-reduction (GnR), gathers embedding vectors and then reduces them to a new embedding vector. Because the matrices in RNNs and the embedding tables in recommendation models have poor reusability and the ever-increasing sizes of the matrices and the embedding tables become too large to fit in the on-chip storage of devices, the performance and energy efficiency of MV-mul and GnR are determined by those of main-memory DRAM. Therefore, computing these operations within DRAM draws significant attention. In this dissertation, we first propose a main-memory architecture called MViD, which performs MV-mul by placing MAC units inside DRAM banks. For higher computational efficiency, we use a sparse matrix format and exploit quantization. Because of the limited power budget for DRAM devices, we implement the MAC units only on a portion of the DRAM banks. We architect MViD to slow down or pause MV-mul for concurrently processing memory requests from processors while satisfying the limited power budget. Our results show that MViD provides 7.2× higher throughput compared to the baseline system with four DRAM ranks (performing MV-mul in a chip-multiprocessor) while running inference of Deep Speech 2 with a memory-intensive workload. Then we propose TRiM, an NDP architecture for accelerating recommendation systems. Based on the observation that the DRAM datapath has a hierarchical tree structure, TRiM augments the DRAM datapath with "in-DRAM" reduction units at the DDR4/5 rank/bank-group/bank level. We modify the interface of DRAM to provide commands effectively to multiple reduction units running in parallel. We also propose a host-side architecture with hot embedding-vector replication to alleviate the load imbalance that arises across the reduction units. An optimal TRiM design based on DDR5 achieves up to a 7.7× and 3.9× speedup and reduces by 55% and 50% the energy consumption of the embedding vector gather and reduction over the baseline and the state-of-the-art NDP architecture with minimal area overhead equivalent to 2.66% of DRAM chips.최근 많은 신경망 연구들이 관심을 받으면서, RNN 모델 혹은 추천 시스템 모델과 같은 메모리 집약적 신경망 모델들이 다양한 작업을 처리하기 위해서 등장하고있다. RNN 모델과 추천 시스템 모델은 대부분의 실행 시간 동안 각각 행렬-벡터 곱을 연산하고 임베딩 레이어를 처리한다. 임베딩 레이어의 기본 연산인 GnR 연산은 여러개의 임베딩 벡터를 모은 다음 이들을 합치는 동작을 한다. RNN 처리시 필요한 행렬과 추천 시스템 모델 처리시 필요한 임베딩 테이블은 재사용성이 낮고 이들의 크기는 계속 증가하여 온칩 스토리지에 저장될 수 없기 때문에 행렬-벡터 곱 및 GnR 연산의 성능 및 에너지 효율성은 주 메모리 DRAM의 성능 및 에너지 효율성에 의해 결정된다. 따라서 DRAM 내에서 이러한 연산을 처리하는 방식이 관심을 끌고있다. 본 논문에서는 먼저 DRAM 뱅크 내부에 MAC 유닛을 배치하여 행렬-벡터 곱을 수행하는 MViD라는 주 메모리 구조를 제안한다. 그리고 더 높은 계산 효율성을 위해 희소 행렬 형식을 사용하고 양자화를 활용한다. DRAM 장치가 사용할 수 있는 제한된 전력 때문에 DRAM 뱅크의 일부에만 MAC 장치를 구현한다. 전력 제한 조건을 충족하면서 프로세서의 메모리 요청을 동시에 처리하기 위해 행렬-벡터곱을 늦추거나 일시 중지하도록 MViD를 설계한다. 그 결과로 MViD가 메모리 집약적 워크로드로 Deep Speech 2의 추론을 실행하면서 4개의 DRAM 랭크를 사용하는 프로세서에서 행렬-벡터곱을 처리하는 기준 시스템에 비해 7.2배 더 높은 처리량을 제공한다는 것을 보여준다. 그리고 우리는 추천 시스템을 가속하기 위한 메모리 근처 처리 구조인 TRiM을 제안한다. DRAM 데이터 경로가 계층적 트리 구조를 갖는다는 사실을 기반으로 TRiM은 DDR4/5 랭크/뱅크그룹/뱅크 수준에서 DRAM 내부 벡터 감소 장치로 DRAM 데이터 경로를 강화한다. 병렬로 실행되는 여러 벡터 감소 장치에 명령을 효과적으로 제공하기 위해 DRAM의 인터페이스를 수정한다. 또한 벡터 감소 장치에서 발생하는 부하 불균형을 완화하기 위해 호스트 측 구조에 핫 임베딩 벡터 복제를 제안한다. DDR5를 기반으로 하는 최적의 TRiM 설계는 DRAM 칩의 2.66%에 해당하는 크기 오버헤드만으로 최대 7.7배 및 3.9배의 속도 향상을 달성하고 임베딩 벡터 수집의 에너지 소비를 55% 및 50% 줄인다.Abstract i Contents iv List of Tables vii List of Figures viii 1 Introduction 1 1.1 Accelerating RNNs on Edge 3 1.2 Accelerating Recommendation Model 5 1.3 Research Contributions 8 1.4 Outline 9 2 Background 11 2.1 Memory-intensive Machine Learning Applications 11 2.2 DRAM Organization and Operations 13 3 MViD: Sparse Matrix-Vector Multiplication in Mobile DRAM for Accelerating Recurrent Neural Networks 18 3.1 Background and Motivation 18 3.1.1 Energy-efficient RNN Mobile Inference 18 3.1.2 How to Improve the Energy Efficiency and Bandwidth of DRAM Accesses in MV-mul 21 3.2 MV-mul in DRAM 23 3.2.1 Exploiting Quantization and Sparsity in RNN's Matrix Elements 23 3.2.2 The Operation Sequence of MV-mul in DRAM 27 3.2.3 Concurrently Serving Requests from Processors and Performing MV-mul in DRAM 32 3.2.4 Put It All Together: MViD Architecture 37 3.2.5 Additional Optimization Schemes 38 3.3 Evaluation 39 3.3.1 Power/Area/Timing Analysis 39 3.3.2 Performance/Energy Evaluation 42 3.4 Discussion 48 4 TRiM: Enhancing Processor-Memory Interfaces with Scalable Tensor Reduction in Memory 51 4.1 Prior NDP architectures for accelerating Tensor Gather-andReduction 51 4.1.1 Tensor Gather-and-Reduction in RecSys 51 4.1.2 Prior NDP accelerators for GnR 52 4.1.3 Quantitative Analysis 56 4.1.4 Additional Schemes for Accelerating GnR 58 4.2 Tensor Reduction in Memory 58 4.2.1 Basic Concept for TRiM 59 4.2.2 How to Provision C/A Bandwidth 62 4.2.3 Exploring NDP Unit Placement 65 4.2.4 TRiM-G Organization and Operations 68 4.2.5 Host-side Architecture for TRiM 70 4.2.6 Schemes for Improving Reliability 75 4.3 Experimental Setup 76 4.4 Evaluation 77 4.4.1 Performance and Energy Efficiency 79 4.4.2 Sensitivity Study of Hot-entry Replication 82 4.4.3 Design Overhead 82 4.5 Discussion 83 5 Discussion 86 6 Related work 89 7 Conclusion 92 REFERENCES 94 국문초록 117박

Self-timed circuits using DCVSL semi-bundled delay wrappers

Author: Brunvand Erik L.
Yang Jung-Lin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2005
Field of study

Journal ArticleWe present a technique for generating robust self-timed completion signals for general dynamic datapath circuits. The wrapper circuit is based on our previous domino semi-bundled delay (SBD) circuits, but uses DCVSL circuits in the wrapper for higher performance. We describe the basic SBD-DCVSL building blocks in the template with respect to their circuit structures and operational behavior. These DCVSL SBD circuits show better performance, exhibit reduced overhead, and require reduced operating margins for the matched delay compared with the domino version. The DCVSL wrapper can also identify a class of delay faults in the data path

Optimization of DSSS Receivers Using Hardware-in-the-Loop Simulations

Author: Dhillon Balbir Kaur
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/12/2005
Field of study

Over the years, there has been significant interest in defining a hardware abstraction layer to facilitate code reuse in software defined radio (SDR) applications. Designers are looking for a way to enable application software to specify a waveform, configure the platform, and control digital signal processing (DSP) functions in a hardware platform in a way that insulates it from the details of realization. This thesis presents a tool-based methodolgy for developing and optimizing a Direct Sequence Spread Spectrum (DSSS) transceiver deployed in custom hardware like Field Programmble Gate Arrays (FPGAs). The system model consists of a tranmitter which employs a quadrature phase shift keying (QPSK) modulation scheme, an additive white Gaussian noise (AWGN) channel, and a receiver whose main parts consist of an analog-to-digital converter (ADC), digital down converter (DDC), image rejection low-pass filter (LPF), carrier phase locked loop (PLL), tracking locked loop, down-sampler, spread spectrum correlators, and rectangular-to-polar converter. The design methodology is based on a new programming model for FPGAs developed in the industry by Xilinx Inc. The Xilinx System Generator for DSP software tool provides design portability and streamlines system development by enabling engineers to create and validate a system model in Xilinx FPGAs. By providing hierarchical modeling and automatic HDL code generation for programmable devices, designs can be easily verified through hardware-in-the-loop (HIL) simulations. HIL provides a significant increase in simulation speed which allows optimization of the receiver design with respect to the datapath size for different functional parts of the receiver. The parameterized datapath points used in the simulation are ADC resolution, DDC datapath size, LPF datapath size, correlator height, correlator datapath size, and rectangular-to-polar datapath size. These parameters are changed in the software enviornment and tested for bit error rate (BER) performance through real-time hardware simualtions. The final result presents a system design with minimum harware area occupancy relative to an acceptable BER degradation