Search CORE

929 research outputs found

Nonphotolithographic nanoscale memory density prospects

Author: DeHon André
Goldstein Seth Copen
Kuekes Philip J.
Publication venue
Publication date: 01/03/2005
Field of study

Technologies are now emerging to construct molecular-scale electronic wires and switches using bottom-up self-assembly. This opens the possibility of constructing nanoscale circuits and memories where active devices are just a few nanometers square and wire pitches may be on the order of ten nanometers. The features can be defined at this scale without using photolithography. The available assembly techniques have relatively high defect rates compared to conventional lithographic integrated circuits and can only produce very regular structures. Nonetheless, with proper memory organization, it is reasonable to expect these technologies to provide memory densities in excess of 10/sup 11/ b/cm/sup 2/ with modest active power requirements under 0.6 W/Tb/s for random read operations

Caltech Authors

Scaling of a Fast Fourier Transform and a pseudo-spectral fluid solver up to 196608 cores

Author: Chatterjee Anando G.
Hadri Bilel
Khurram Rooh
Kumar Abhishek
Samtaney Ravi
Verma Mahendra K.
Publication venue: 'Elsevier BV'
Publication date: 01/03/2018
Field of study

In this paper we present scaling results of a FFT library, FFTK, and a pseudospectral code, Tarang, on grid resolutions up to

8192^3

grid using 65536 cores of Blue Gene/P and 196608 cores of Cray XC40 supercomputers. We observe that communication dominates computation, more so on the Cray XC40. The computation time scales as

T_\mathrm{comp} \sim p^{-1}

, and the communication time as

T_\mathrm{comm} \sim n^{-\gamma_2}

with

\gamma_2

ranging from 0.7 to 0.9 for Blue Gene/P, and from 0.43 to 0.73 for Cray XC40. FFTK, and the fluid and convection solvers of Tarang exhibit weak as well as strong scaling nearly up to 196608 cores of Cray XC40. We perform a comparative study of the performance on the Blue Gene/P and Cray XC40 clusters

arXiv.org e-Print Archive

Crossref

Coventry University Pure Portal

High Performance Biological Pairwise Sequence Alignment: FPGA versus GPU versus Cell BE versus GPP

Author: Akoglu Ali
Benkrid Khaled
Ling Cheng
Liu Ying
Song Yang
Tian Xiang
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2012
Field of study

This paper explores the pros and cons of reconfigurable computing in the form of FPGAs for high performance efficient computing. In particular, the paper presents the results of a comparative study between three different acceleration technologies, namely, Field Programmable Gate Arrays (FPGAs), Graphics Processor Units (GPUs), and IBM’s Cell Broadband Engine (Cell BE), in the design and implementation of the widely-used Smith-Waterman pairwise sequence alignment algorithm, with general purpose processors as a base reference implementation. Comparison criteria include speed, energy consumption, and purchase and development costs. The study shows that FPGAs largely outperform all other implementation platforms on performance per watt criterion and perform better than all other platforms on performance per dollar criterion, although by a much smaller margin. Cell BE and GPU come second and third, respectively, on both performance per watt and performance per dollar criteria. In general, in order to outperform other technologies on performance per dollar criterion (using currently available hardware and development tools), FPGAs need to achieve at least two orders of magnitude speed-up compared to general-purpose processors and one order of magnitude speed-up compared to domain-specific technologies such as GPUs

Crossref

Directory of Open Access Journals

Edinburgh Research Explorer

이진 뉴럴 네트워크를 위한 DRAM 기반의 뉴럴 네트워크 가속기 구조

Author: 최해랑
Publication venue: 서울대학교 대학원
Publication date: 01/02/2021
Field of study

학위논문 (박사) -- 서울대학교 대학원 : 공과대학 컴퓨터공학부, 2021. 2. 유승주.In the convolutional neural network applications, most computations occurred by the multiplication and accumulation of the convolution and fully-connected layers. From the hardware perspective (i.e., in the gate-level circuits), these operations are performed by many dot-products between the feature map and kernel vectors. Since the feature map and kernel have the matrix form, the vector converted from 3D, or 4D matrices is reused many times for the matrix multiplications. As the throughput of the DNN increases, the power consumption and performance bottleneck due to the data movement become a more critical issue. More importantly, power consumption due to off-chip memory accesses dominates total power since off-chip memory access consumes several hundred times greater power than the computation. The accelerators' throughput is about several hundred GOPS~several TOPS, but Memory bandwidth is less than 25.6 or 34 GB/s (with DDR4 or LPDDR4). By reducing the network size and/or data movement size, both data movement power and performance bottleneck problems are improved. Among the algorithms, Quantization is widely used. Binary Neural Networks (BNNs) dramatically reduce precision down to 1 bit. The accuracy is much lower than that of the FP16, but the accuracy is continuously improving through various studies. With the data flow control, there is a method of reducing redundant data movement by increasing data reuse. The above two methods are widely applied in accelerators because they do not need additional computations in the inference computation. In this dissertation, I present 1) a DRAM-based accelerator architecture and 2) a DRAM refresh method to improve performance reduction due to DRAM refresh. Both methods are orthogonal, so can be integrated into the DRAM chip and operate independently. First, we proposed a DRAM-based accelerator architecture capable of massive and large vector dot product operation. In the field of CNN accelerators to which BNN can be applied, a computing-in-memory (CIM) structure that utilizes a cell-array structure of Memory for vector dot product operation is being actively studied. Since DRAM stores all the neural network data, it is advantageous to reduce the amount of data transfer. The proposed architecture operates by utilizing the basic operation of the DRAM. The second method is to reduce the performance degradation and power consumption caused by DRAM refresh. Since the DRAM cannot read and write data while performing a periodic refresh, system performance decreases. The proposed refresh method tests the refresh characteristics inside the DRAM chip during self-refresh and increases the refresh cycle according to the characteristics. Since it operates independently inside DRAM, it can be applied to all systems using DRAM and is the same for deep neural network accelerators. We surveyed system integration with a software stack to use the in-DRAM accelerator in the DL framework. As a result, it is expected to control in-DRAM accelerators with the memory controller implementation method verified in the previous experiment. Also, we have added the performance simulation function of in-DRAM accelerator to PyTorch. When running a neural network in PyTorch, it reports the computation latency and data movement latency occurring in the layer running in the in-DRAM accelerator. It is a significant advantage to predict the performance when running in hardware while co-designing the network.컨볼루셔널 뉴럴 네트워크 (CNN) 어플리케이션에서는, 대부분의 연산이 컨볼루션 레이어와 풀리-커넥티드 레이어에서 발생하는 곱셈과 누적 연산이다. 게이트-로직 레벨에서는, 대량의 벡터 내적으로 실행되며, 입력과 커널 벡터들을 반복해서 사용하여 연산한다. 딥 뉴럴 네트워크 연산에는 범용 연산 유닛보다, 단순한 연산이 가능한 작은 연산 유닛을 대량으로 사용하는 것이 적합하다. 가속기의 성능이 일정 이상 높아지면, 가속기의 성능은 연산에 필요한 데이터 전송에 의해 제한된다. 메모리에서 데이터를 오프-칩으로 전송할 때의 에너지 소모가, 연산 유닛에서 연산에 사용되는 에너지의 수백배로 크다. 또한 연산기의 성능은 초당 수백 기가~수 테라-연산이 가능하지만, 메모리의 데이터 전송은 초당 수십 기가 바이트이다. 데이터 전송에 의한 파워와 성능 문제를 동시에 해결하는 방법은, 전송되는 데이터 크기를 줄이는 것이다. 알고리즘 중에서는 네트워크의 데이터를 양자화하여, 낮은 정밀도로 데이터를 표현하는 방법이 널리 사용된다. 이진 뉴럴 네트워크(BNN)는 정밀도를 1비트까지 극단적으로 낮춘다. 16비트 정밀도보다 네트워크의 정확도가 낮아지는 문제가 있지만, 다양한 연구를 통해 정확도가 지속적으로 개선되고 있다. 또한 구조적으로는, 전송된 데이터를 재사용하여 동일한 데이터의 반복적인 전송을 줄이는 방법이 있다. 위의 두 가지 방법은 추론 과정에서 별도의 연산 없이 적용 가능하여 가속기에서 널리 적용되고 있다. 본 논문에서는, DRAM 기반의 가속기 구조를 제안하고, DRAM refresh에 의한 성능 감소를 개선하는 기술을 제안하였다. 두 방법은 하나의 DRAM 칩으로 집적 가능하며, 독립적으로 구동 가능하다. 첫번째는 대량의 벡터 내적 연산이 가능한 DRAM 기반 가속기에 대한 연구이다. BNN을 적용할 수 있는 CNN가속기 분야에서, 메모리의 셀-어레이 구조를 벡터 내적 연산에 활용하는 컴퓨팅-인-메모리(CIM) 구조가 활발히 연구되고 있다. 특히, DRAM에는 뉴럴 네트워크의 모든 데이터가 있기 때문에, 데이터 전송량의 감소에 유리하다. 우리는 DRAM 셀-어레이의 구조를 바꾸지 않고, DRAM의 기본 동작을 활용하여 연산하는 방법을 제안하였다. 두번째는 DRAM 리프레쉬 주기를 늘려서 성능 열화와 파워 소모를 개선하는 방법이다. DRAM이 리프레쉬를 실행할 때마다, 데이터를 읽고 쓸 수 없기 때문에 시스템 혹은 가속기의 성능 감소가 발생한다. DRAM 칩 내부에서 DRAM의 리프레쉬 특성을 테스트하고, 리프레쉬 주기를 늘리는 방법을 제안하였다. DRAM 내부에서 독립적으로 동작하기 때문에 DRAM을 사용하는 모든 시스템에 적용 가능하며, 딥 뉴럴 네트워크 가속기에서도 동일하다. 또한, 제안된 가속기를 PyTorch와 같이 널리 사용되는 딥러닝 프레임 워크에서도 쉽게 사용할 수 있도록, 소프트웨어 스택을 비롯한 system integration 방법을 조사하였다. 결과적으로, 기존의 TVM compiler와 FPGA로 구현하는 TVM/VTA 가속기에, DRAM refresh 실험에서 검증된 메모리 컨트롤러와 커스텀 컴파일러를 추가하면 in-DRAM 가속기를 제어할 수 있을 것으로 기대된다. 이에 더하여, in-DRAM 가속기와 뉴럴 네트워크의 설계 단계에서 성능을 예측할 수 있도록, 시뮬레이션 기능을 PyTorch에 추가하였다. PyTorch에서 신경망을 실행할 때, DRAM 가속기에서 실행되는 계층에서 발생하는 계산 대기 시간 및 데이터 이동 시간을 확인할 수 있다.Abstract i Contents viii List of Tables x List of Figures xiv Chapter 1 Introduction 1 Chapter 2 Background 6 2.1 Neural Network Operation . . . . . . . . . . . . . . . . 6 2.2 Data Movement Overhead . . . . . . . . . . . . . . . . 7 2.3 Binary Neural Networks . . . . . . . . . . . . . . . . . 10 2.4 Computing-in-Memory . . . . . . . . . . . . . . . . . . 11 2.5 Memory Bottleneck due to Refresh . . . . . . . . . . . . 13 Chapter 3 In-DRAM Neural Network Accelerator 16 3.1 Backgrounds . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1.1 DRAM hierarchy . . . . . . . . . . . . . . . . . 18 3.1.2 DRAM Basic Operation . . . . . . . . . . . . . 21 3.1.3 DRAM Commands with Timing Parameters . . . 22 3.1.4 Bit-wise Operation in DRAM . . . . . . . . . . 25 3.2 Motivations . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 Proposed architecture . . . . . . . . . . . . . . . . . . . 30 3.3.1 Operation Examples of Row Operator . . . . . . 32 3.3.2 Convolutions on DRAM Chip . . . . . . . . . . 39 3.4 Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4.1 Input Broadcasting in DRAM . . . . . . . . . . 44 3.4.2 Input Data Movement With M2V . . . . . . . . . 47 3.4.3 Internal Data Movement With SiD . . . . . . . . 49 3.4.4 Data Partitioning for Parallel Operation . . . . . 52 3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 56 3.5.1 Performance Estimation . . . . . . . . . . . . . 56 3.5.2 Configuration of In-DRAM Accelerator . . . . . 58 3.5.3 Improving the Accuracy of BNN . . . . . . . . . 60 3.5.4 Comparison with the Existing Works . . . . . . . 62 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.6.1 Performance Comparison with ASIC Accelerators 67 3.6.2 Challenges of The Proposed Architecture . . . . 70 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 72 Chapter 4 Reducing DRAM Refresh Power Consumption by Runtime Profiling of Retention Time and Dualrow Activation 74 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 74 4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . 77 4.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . 78 4.4 Observations . . . . . . . . . . . . . . . . . . . . . . . . 84 4.5 Solution overview . . . . . . . . . . . . . . . . . . . . . 88 4.6 Runtime profiling . . . . . . . . . . . . . . . . . . . . . 93 4.6.1 Basic Operation . . . . . . . . . . . . . . . . . . 93 4.6.2 Profiling Multiple Rows in Parallel . . . . . . . . 96 4.6.3 Temperature, Data Backup and Error Check . . . 96 4.7 Dual-row Activation . . . . . . . . . . . . . . . . . . . . 98 4.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 102 4.8.1 Experimental Setup . . . . . . . . . . . . . . . . 103 4.8.2 Refresh Period Improvement . . . . . . . . . . . 107 4.8.3 Power Reduction . . . . . . . . . . . . . . . . . 110 4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 116 Chapter 5 System Integration 118 5.1 Integrate The Proposed Methods . . . . . . . . . . . . . 118 5.2 Software Stack . . . . . . . . . . . . . . . . . . . . . . 121 Chapter 6 Conclusion 129 Bibliography 131 국문초록 153Docto

SNU Open Repository and Archive

Extended Field Laser Confocal Microscopy (EFLCM): Combining automated Gigapixel image capture with in silico virtual microscopy

Author: A Czirok
AM Marchevsky
C Conrad
C Sun
CHD Kuglin
Christer Strandh
D Zhang
E Flaberg
E Flaberg
E Wang
EA Zamir
Emilie Flaberg
F Demichelis
FJ Leong
FJ Leong
FR Dee
G Mehes
G Stuber
GJ Brakenhoff
H Yamamoto
J Bocsi
JC Beck
JG White
JL Lucitti
K Glatz-Krieger
L Markasz
L Markasz
L Markasz
Laszlo Szekely
LMG Brown
LMG Brown
M Arrasate
M Iregui
M Kozubek
M Lundin
M Lundin
P Holmvall
P Sabelström
P Saggau
Per Sabelström
PJ Narayan
R Graf
RS Weinstein
SH Lee
SJ Wright
VS Varga
WB Amos
X Ying
X Zhou
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Confocal laser scanning microscopy has revolutionized cell biology. However, the technique has major limitations in speed and sensitivity due to the fact that a single laser beam scans the sample, allowing only a few microseconds signal collection for each pixel. This limitation has been overcome by the introduction of parallel beam illumination techniques in combination with cold CCD camera based image capture. Methods Using the combination of microlens enhanced Nipkow spinning disc confocal illumination together with fully automated image capture and large scale <it>in silico </it>image processing we have developed a system allowing the acquisition, presentation and analysis of maximum resolution confocal panorama images of several Gigapixel size. We call the method Extended Field Laser Confocal Microscopy (EFLCM). Results We show using the EFLCM technique that it is possible to create a continuous confocal multi-colour mosaic from thousands of individually captured images. EFLCM can digitize and analyze histological slides, sections of entire rodent organ and full size embryos. It can also record hundreds of thousands cultured cells at multiple wavelength in single event or time-lapse fashion on fixed slides, in live cell imaging chambers or microtiter plates. Conclusion The observer independent image capture of EFLCM allows quantitative measurements of fluorescence intensities and morphological parameters on a large number of cells. EFLCM therefore bridges the gap between the mainly illustrative fluorescence microscopy and purely quantitative flow cytometry. EFLCM can also be used as high content analysis (HCA) instrument for automated screening processes.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

Publications from Karolinska Institutet

PubMed Central

Efficient Hardware Implementation of Deep Learning Networks Based on the Convolutional Neural Network

Author: Ansari Anaam
Publication venue: Scholar Commons
Publication date: 07/06/2023
Field of study

Image classification, speech processing, autonomous driving, and medical diagnosis have made the adoption of Deep Neural Networks (DNN) mainstream. Many deep networks such as AlexNet, GoogleNet, ResidualNet, MobileNet, YOLOv3 and Transformers have achieved immense success and popularity. However, implementing these deep and complex networks in hardware is a challenging feat. The growing demand of DNN applications in mobile devices and data centers have led the researchers to explore application specific hardware accelerators for DNNs. There have been numerous hardware and software based solutions to improve DNN throughput, latency, performance and accuracy. Any solution for hardware acceleration needs to optimize in a space confined by these metrics. Hardware acceleration of Deep Neural Networks (DNN) is a highly effective and viable solution for running them on mobile devices. The power of DNN is now available at the edge in a compact and power-efficient form factor because of hardware acceleration. In this thesis, we introduce a novel architecture that uses a generalized method called Single Input Partial Product 2-Dimensional Convolution (SIPP2D Convolution) which calculates a 2-D convolution in a fast and expedient manner. We present the exploration designs that have culminated into SIPP2D and emphasize its benefits. SIPP2D architecture prevents the re-fetching of input weights for the calculation of partial products. It can calculate the output of any input size and kernel size with a low memory-traffic while maintaining a low latency and high throughput compared to other popular techniques. In addition to being compatible with any input and kernel size, SIPP2D architecture can be modified to support any allowable stride. We describe the data flow and algorithmic modifications to SIPP2D which extends its capabilities to accommodate multi-stride convolutions. Supporting multi-stride convolutions is an essential feature addition to SIPP2D architecture, increasing its versatility and network agnostic character for convolutional type DNNs. Along with architectural explorations, we have also performed research in the area of model optimization. It is widely understood that any change on the algorithmic level of the network pays significant dividends at the hardware level. Compression and optimization techniques such as pruning and quantization help reduce the size of the model while maintaining the accuracy at an acceptable level. Thus, by combining techniques such as channel pruning with SIPP2D we can only boost its performance. In this thesis, we examine the performance of channel pruned SIPP2D compared to other compressed models. Traditionally, quantization of weights and inputs are used to reduce the memory transfer and power consumption. However, quantizing the outputs of layers can be a challenge since the output of each layer changes with the input. In our research, we use quantization on the output of each layer for AlexNet and VGGNet-16 to analyze the effect it has on accuracy. We use Signal to Noise Quantization Ratio (SQNR) to empirically determine the integer length (IL) as well as the fractional length (FL) for the fixed point precision that can yields the lowest SQNR and highest accuracy. Based on our observations, we can report that accuracy is sensitive to fractional length as well as integer length. For AlexNet, we observe deterioration in accuracy as the word length decreases. The Top -5 accuracy goes from 77% for floating point precision to 56% for a WL of 12 and FL of 8. The results are similar in the case of VGGNet-16. The Top-5 accuracy for VGGNet-16 decreases from 82% for floating point to 30% for a WL of 12 and FL of 8. In addition to the small word length, we observe the accuracy to be highly dependent on the integer length as well as the fractional length. We have also done analysis on the loss after retraining post quantization. We use polynomial fitting to achieve a relationship with fractional length and the drop in accuracy still sustained after retraining a quantized network. In summary, the winning combination of the enhanced SIPP2D architecture and compression techniques such as channel pruning and quantization techniques is highly advantageous and conducive to widespread adoption. SIPP2D architecture, with its flexible data flow and algorithmic modifications to support multi-stride convolutions, offers a powerful and versatile framework for deep neural networks

Scholar Commons - Santa Clara University

Exploration of Activation Fault Reliability in Quantized Systolic Array-Based DNN Accelerators

Author: Ansari Mohammad Saeed
Cherezova Natalia
Daneshtalab Masoud
Jenihhin Maksim
Mahani Ali
Raik Jaan
Taheri Mahdi
Publication venue
Publication date: 17/01/2024
Field of study

The stringent requirements for the Deep Neural Networks (DNNs) accelerator's reliability stand along with the need for reducing the computational burden on the hardware platforms, i.e. reducing the energy consumption and execution time as well as increasing the efficiency of DNN accelerators. Moreover, the growing demand for specialized DNN accelerators with tailored requirements, particularly for safety-critical applications, necessitates a comprehensive design space exploration to enable the development of efficient and robust accelerators that meet those requirements. Therefore, the trade-off between hardware performance, i.e. area and delay, and the reliability of the DNN accelerator implementation becomes critical and requires tools for analysis. This paper presents a comprehensive methodology for exploring and enabling a holistic assessment of the trilateral impact of quantization on model accuracy, activation fault reliability, and hardware efficiency. A fully automated framework is introduced that is capable of applying various quantization-aware techniques, fault injection, and hardware implementation, thus enabling the measurement of hardware parameters. Moreover, this paper proposes a novel lightweight protection technique integrated within the framework to ensure the dependable deployment of the final systolic-array-based FPGA implementation. The experiments on established benchmarks demonstrate the analysis flow and the profound implications of quantization on reliability, hardware performance, and network accuracy, particularly concerning the transient faults in the network's activations.Comment

arXiv.org e-Print Archive