8 research outputs found

    Artificial Neural Network Models for Pattern Discovery from ECG Time Series

    Get PDF
    Artificial Neural Network (ANN) models have recently become de facto models for deep learning with a wide range of applications spanning from scientific fields such as computer vision, physics, biology, medicine to social life (suggesting preferred movies, shopping lists, etc.). Due to advancements in computer technology and the increased practice of Artificial Intelligence (AI) in medicine and biological research, ANNs have been extensively applied not only to provide quick information about diseases, but also to make diagnostics accurate and cost-effective. We propose an ANN-based model to analyze a patient\u27s electrocardiogram (ECG) data and produce accurate diagnostics regarding possible heart diseases (arrhythmia, myocardial infarct, etc.). Our model is mainly characterized by its simplicity, as it does not require significant computational power to produce the results. We create and test our model using the MIT-BIH and PTB diagnostics datasets, which are real ECG time series datasets from thousands of patients

    Accelerating Sensitivity Analysis in Microscopy Image Segmentation Workflows

    Get PDF
    With the increasingly availability of digital microscopy imagery equipments there is a demand for efficient execution of whole slide tissue image applications. Through the process of sensitivity analysis it is possible to improve the output quality of such applications, and thus, improve the desired analysis quality. Due to the high computational cost of such analyses and the recurrent nature of executed tasks from sensitivity analysis methods (i.e., reexecution of tasks), the opportunity for computation reuse arises. By performing computation reuse we can optimize the run time of sensitivity analysis applications. This work focuses then on finding new ways to take advantage of computation reuse opportunities on multiple task abstraction levels. This is done by presenting the coarse-grain merging strategy and the new fine-grain merging algorithms, implemented on top of the Region Templates Framework.Comment: 44 page

    Optimización de redes neuronales para ejecución eficiente de aplicaciones de inteligencia artificial en GPUs embebidos

    Get PDF
    Proyecto de Graduación (Maestría en Electrónica) Instituto Tecnológico de Costa Rica, Escuela de Ingeniería Electrónica, 2021El área de la inteligencia artificial ha tenido un gran desarrollo en los últimos años, por lo cual se han logrado grandes avances y mejoras que han llevado inclusive a la sustitución de algoritmos clásicos para la solución de ciertos problemas específicos. Esto ha provocado que las redes neuronales al evolucionar puedan llegar a ser computacionalmente intensivas y lleguen a requerir gran cantidad de recursos con los que no siempre se puede contar. Con el fin de implementar eficientemente en aplicaciones de la vida real los modelos entrenados, estos deben ser optimizados para las diferentes arquitecturas. Para estudiar las optimizaciones de redes neuronales en sistemas embebidos, se diseña un modelo de detección de placas oficiales de vehículos de Costa Rica, el cual se optimiza para su ejecución eficiente en un GPU móvil. Las optimizaciones que se aplican en este proyecto corresponden a Cuantización y Pruning y se aplican a diferentes frameworks con el fin de observar los efectos en varias configuraciones.The area of Artificial intelligence has had a big development in the last few years, in consequence, many great advances and improvements have caused even the substitution of classic algorithms for the resolution of specific problems. As a result of this, the architectures may become computationally intense and start requiring a big amount of resources that are not always available. In order to implement those efficiently in real life applications, the models require to be optimized for the different architectures. To study the neural network optimization in embedded systems, a detection model for official Costa Rican vehicles plates has been designed, this is being also optimized for its efficient use on a mobile GPU. The optimizations to be applied on this project are Quantization and Pruning, those are applied to different frameworks to observe the effects on the different configurations

    Acceleration of CNN Computation on a PIM-enabled GPU system

    Get PDF
    학위논문(박사) -- 서울대학교대학원 : 공과대학 전기·정보공학부, 2022. 8. 이혁재.최근, convolutional neural network (CNN)은 image processing 및 computer vision 등에서 널리 사용되고 있다. CNN은 연산 집약적인 convolutional layer와 메모리 집약적인 fully connected layer, batch normalization layer 및 activation layer 등 다양한 layer로 구성된다. 일반적으로 CNN을 처리하기 위해 GPU가 널리 사용되지만, CNN은 연산 집약적인 동시에 메모리 집약적이기에 성능이 제한된다. 또한, 고화질의 image 및 video application의 사용은 GPU와 메모리 간의 data 이동에 의한 부담을 증가시킨다. Processing-in-memory는 메모리에 연산기를 탑재하여 data 이동에 의한 부담을 줄일 수 있어, host GPU와 PIM을 함께 사용하는 system은 CNN을 처리하기에 적합하다. 먼저 convolutional layer의 연산량을 감소시키기 위해, 근사 연산을 수행할 수 있다. 그러나 기존의 근사 연산은 host GPU로 data를 load 한 후 data 간 유사도를 파악하기에, GPU와 DRAM 간의 data 이동량을 줄이지는 못한다. 이는 메모리 intensity를 증가시켜 메모리 bottleneck을 유발한다. 게다가, 근사 연산으로 인해 warp 간 load imbalance 또한 발생하게 되어 성능이 저하된다. 이를 해결하기 위해, 본 논문에서는 data 간 근사 비교를 PIM에서 수행하는 방법을 제안한다. 제안하는 방법은 PIM에서 data간 유사도를 파악한 후, 대표 data와 유사도 정보만을 GPU로 전송한다. GPU는 대표 data에 대해서만 연산을 수행하고, 유사도 정보에 기반하여 해당 결과를 재사용하여 최종 결과를 생성한다. 이때, 메모리에서의 data 비교로 인한 latency 증가를 방지하기 위해 DRAM의 bank 단과 TSV 단을 모두 활용하는 2-level PIM 구조를 제안한다. 또한, 대표 data를 적당한 address에 재배치한 후 GPU로 전송하여 GPU에서의 별도 작업 없이 load balancing을 가능하게 한다. 다음으로, batch normalization 등 non-convolutional layer의 높은 메모리 사용량으로 인한 메모리 bottleneck 문제를 해결하기 위해 PIM에서 non-convolutional layer를 수행할 수 있다. 기존 연구에서는 PIM으로 non-convolutional layer를 가속하였지만, 단순히 GPU와 PIM이 순차적으로 동작하는 상황을 가정하여 성능 향상에 한계가 있었다. 제안하는 방법은 non-convolutional layer가 ouptut feature map의 channel 단위로 수행된다는 점에 착안하여 host와 PIM을 pipeline적으로 수행함으로써 CNN 학습을 가속한다. PIM은 host에서 convolution 연산이 끝난 output feature map의 channel에 대해 non-convolution 연산을 수행한다. 역전파 과정에서 발생하는 weight update와 feature map gradient 계산에서의 convolution과 non-convolution 간 job 균형을 위해, 적절하게 non-convolution job을 분배하여 성능을 향상시킨다. 이에 더해, host와 PIM이 동시에 memory에 access하는 상황에서 전체 수행 시간을 최소화하기 위해 bank 소유권 기반의 host와 PIM 간 memory scheduling 알고리즘을 제안한다. 마지막으로, image processing application 처리를 위해 logic die에 탑재 가능한 PIM GPU 구조를 제안한다. GPU 기반의 PIM은 CUDA 기반의 application을 수행할 수 있어 딥러닝 및 image application의 처리에 적합하지만, GPU의 큰 용량의 on-chip SRAM은 logic die에 충분한 수의 computing unit의 탑재를 어렵게 한다. 본 논문에서는 PIM에 적합한 최적의 lightweight GPU 구조와 함께 이를 활용하기 위한 최적화 기법을 제안한다. Image processing application의 메모리 접근 패턴과 data locality가 보존되도록 각 computing unit에 data를 할당하고, 예측 가능한 data의 할당을 기반으로 prefetcher를 탑재하여 lightweight한 구조임에도 충분한 수의 computing unit을 탑재하여 높은 성능을 확보한다.Recently, convolutional neural networks (CNN) have been widely used in image processing and computer vision. CNNs are composed of various layers such as computation-intensive convolutional layer and memory-intensive fully connected layer, batch normalization layer, and activation layer. GPUs are often used to accelerate the CNN, but performance is limited by high computational costs and memory usage of the convolution. Also, increasing demand for high resolution image applications increases the burden of data movement between GPU and memory. By performing computations on the memory, processing-in-memory (PIM) is expected to mitigate the overhead caused by data transfer. Therefore, a system that uses a PIM is promising for processing CNNs. First, prior studies exploited approximate computing to reduce the computational costs. However, they only reduced the amount of the computation, thereby its performance is bottlenecked by the memory bandwidth due to an increased memory intensity. In addition, load imbalance between warps caused by approximation also inhibits the performance improvement. This dissertation proposes a PIM solution that reduces the amount of data movement and computation through the Approximate Data Comparison (ADC-PIM). Instead of determining the value similarity on the GPU, the ADC-PIM located on memory compares the similarity and transfers only the selected data to the GPU. The GPU performs convolution on the representative data transferred from the ADC-PIM, and reuses the calculated results based on the similarity information. To reduce the increase in memory latency due to the data comparison, a two-level PIM architecture that exploits both the DRAM bank and TSV stage is proposed. To ease the load balancing on the GPU, the ADC-PIM reorganizes data by assigning the representative data to proposer addresses that are computed based on the comparison result. Second, to solve the memory bottleneck caused by the high memory usage, non-convolutional layers are accelerated with PIM. Previous studies also accelerated the non-convolutional layers by PIM, but there was a limit to performance improvement because they simply assumed a situation in which the GPU and PIM operate sequentially. The proposed method accelerates the CNN training with a pipelined execution of GPU and PIM, focusing on the fact that the non-convolution operation is performed in units of channels of the output feature map. PIM performs non-convolutional operations on the output feature map where the GPU has completed the convolution operation. To balance the jobs between convolution and non-convolution in weight update and feature map gradient calculation that occur in the back propagation process, non-convolution job is properly distributed to each process. In addition, a memory scheduling algorithm based on bank ownership between the host and PIM is proposed to minimize the overall execution time in a situation where the host and PIM simultaneously access memory. Finally, a GPU-based PIM architecture for image processing application is proposed. Programmable GPU-based PIM is attractive because it enables the utilization of well-crafted software development kits (SDKs) such as CUDA and openCL. However, the large capacity of on-chip SRAM of GPU makes it difficult to mount a sufficient number of computing units in logic die. This dissertation proposes a GPU-based PIM architecture and well-matched optimization strategies considering both the characteristics of image applications and logic die constraints. Data allocation to the computing unit is addressed to maintain the data locality and data access pattern. By applying a prefetcher that leverages the pattern-aware data allocation, the number of active warps and the on-chip SRAM size of the PIM are significantly reduced. This enables the logic die constraints to be satisfied and a greater number of computing units to be integrated on a logic die.제 1 장 서 론 1 1.1 연구의 배경 1 1.2 연구의 내용 3 1.3 논문 구성 4 제 2 장 연구의 배경 지식 5 2.1 High Bandwidth Memory 5 2.2 Processing-In-Memory 6 2.3 GPU의 구조 및 동작 모델 7 제 3 장 PIM을 활용한 근사적 데이터 비교 및 근사 연산을 통한 Convolution 가속 9 3.1 관련 연구 10 3.1.1 CNN에서의 Approximate Computing 10 3.1.2 Processing In Memory를 활용한 CNN 가속 11 3.2 Motivation 13 3.2.1 GPU에서 Convolution 연산 시의 Approximation 기회 13 3.2.2 Approxiamte Convolution 연산에서 발생하는 문제점 14 3.3 제안하는 ADC-PIM Design 18 3.3.1 Overview 18 3.3.2 Data 간 유사도 비교 방법 19 3.3.3 ADC-PIM 아키텍처 21 3.3.4 Load Balancing을 위한 Data Reorganization 27 3.4 GPU에서의 Approximate Convolution 31 3.4.1 Instruction Skip을 통한 Approximate Convolution 31 3.4.2 Approximate Convolution을 위한 구조적 지원 32 3.5 실험 결과 및 분석 36 3.5.1 실험 환경 구성 36 3.5.2 제안하는 각 방법의 영향 분석 38 3.5.3 기존 연구와의 성능 비교 41 3.5.4 에너지 소모량 비교 44 3.5.5 Design Overhead 분석 44 3.5.6 정확도에 미치는 영향 46 3.6 본 장의 결론 47 제 4 장 Convolutional layer와 non-Convolutional Layer의 Pipeline 실행을 통한 CNN 학습 가속 48 4.1 관련 연구 48 4.1.1 Non-CONV Lasyer의 Memory Bottleneck 완화 48 4.1.2 Host와 PIM 간 Memory Scheduling 49 4.2 Motivation 51 4.2.1 CONV와 non-CONV의 동시 수행 시 성능 향상 기회 51 4.2.2 PIM 우선도에 따른 host 및 PIM request의 처리 효율성 변화 52 4.3 제안하는 host-PIM Memory Scheduling 알고리즘 53 4.3.1 host-PIM System Overview 53 4.3.2 PIM Duration Based Memory Scheduling 53 4.3.3 최적 PD_TH 값의 계산 방법 56 4.4 제안하는 CNN 학습 동작 Flow 62 4.4.1 CNN 학습 순전파 과정 62 4.4.2 CNN 학습 역전파 과정 63 4.5 실험 결과 및 분석 67 4.5.1 실험 환경 구성 67 4.5.2 Layer 당 수행 시간 변화 68 4.5.3 역전파 과정에서의 non-CONV job 배분 효과 70 4.5.4 전체 Network Level에서의 수행 시간 변화 72 4.5.5 제안하는 최적 PD_TH 추정 방법의 정확도 및 선택 알고리즘의 수렴 속도 74 4.6 본 장의 결론 75 제 5 장 Image processing의 데이터 접근 패턴을 활용한 PIM에 적합한 lightweight GPU 구조 76 5.1 관련 연구 77 5.1.1 Processing In Memory 77 5.1.2 GPU에서의 CTA Scheduling 78 5.1.3 GPU에서의 Prefetching 78 5.2 Motivation 79 5.2.1 PIM GPU system에서 Image Processing 알고리즘 처리 시 기존 GPU 구조의 비효율성 79 5.3 제안하는 GPU 기반 PIM System 82 5.3.1 Overview 82 5.3.2 Access Pattern을 고려한 CTA 할당 83 5.3.3 PIM GPU 구조 90 5.4 실험 결과 및 분석 94 5.4.1 실험 환경 구성 94 5.4.2 In-Depth Analysis 95 5.4.3 기존 연구와의 성능 비교 98 5.4.4 Cache Miss Rate 및 Memory Traffic 102 5.4.5 에너지 소모량 비교 103 5.4.6 PIM의 면적 및 전력 소모량 분석 105 5.5 본 장의 결론 107 제 6 장 결론 108 참고문헌 110 Abstract 118박

    Low-power accelerators for cognitive computing

    Get PDF
    Deep Neural Networks (DNNs) have achieved tremendous success for cognitive applications, and are especially efficient in classification and decision making problems such as speech recognition or machine translation. Mobile and embedded devices increasingly rely on DNNs to understand the world. Smartphones, smartwatches and cars perform discriminative tasks, such as face or object recognition, on a daily basis. Despite the increasing popularity of DNNs, running them on mobile and embedded systems comes with several main challenges: delivering high accuracy and performance with a small memory and energy budget. Modern DNN models consist of billions of parameters requiring huge computational and memory resources and, hence, they cannot be directly deployed on low-power systems with limited resources. The objective of this thesis is to address these issues and propose novel solutions in order to design highly efficient custom accelerators for DNN-based cognitive computing systems. In first place, we focus on optimizing the inference of DNNs for sequence processing applications. We perform an analysis of the input similarity between consecutive DNN executions. Then, based on the high degree of input similarity, we propose DISC, a hardware accelerator implementing a Differential Input Similarity Computation technique to reuse the computations of the previous execution, instead of computing the entire DNN. We observe that, on average, more than 60% of the inputs of any neural network layer tested exhibit negligible changes with respect to the previous execution. Avoiding the memory accesses and computations for these inputs results in 63% energy savings on average. In second place, we propose to further optimize the inference of FC-based DNNs. We first analyze the number of unique weights per input neuron of several DNNs. Exploiting common optimizations, such as linear quantization, we observe a very small number of unique weights per input for several FC layers of modern DNNs. Then, to improve the energy-efficiency of FC computation, we present CREW, a hardware accelerator that implements a Computation Reuse and an Efficient Weight Storage mechanism to exploit the large number of repeated weights in FC layers. CREW greatly reduces the number of multiplications and provides significant savings in model memory footprint and memory bandwidth usage. We evaluate CREW on a diverse set of modern DNNs. On average, CREW provides 2.61x speedup and 2.42x energy savings over a TPU-like accelerator. In third place, we propose a mechanism to optimize the inference of RNNs. RNN cells perform element-wise multiplications across the activations of different gates, sigmoid and tanh being the common activation functions. We perform an analysis of the activation function values, and show that a significant fraction are saturated towards zero or one in popular RNNs. Then, we propose CGPA to dynamically prune activations from RNNs at a coarse granularity. CGPA avoids the evaluation of entire neurons whenever the outputs of peer neurons are saturated. CGPA significantly reduces the amount of computations and memory accesses while avoiding sparsity by a large extent, and can be easily implemented on top of conventional accelerators such as TPU with negligible area overhead, resulting in 12% speedup and 12% energy savings on average for a set of widely used RNNs. Finally, in the last contribution of this thesis we focus on static DNN pruning methodologies. DNN pruning reduces memory footprint and computational work by removing connections and/or neurons that are ineffectual. However, we show that prior pruning schemes require an extremely time-consuming iterative process that requires retraining the DNN many times to tune the pruning parameters. Then, we propose a DNN pruning scheme based on Principal Component Analysis and relative importance of each neuron's connection that automatically finds the optimized DNN in one shot.Les xarxes neuronals profundes (DNN) han aconseguit un èxit enorme en aplicacions cognitives, i són especialment eficients en problemes de classificació i presa de decisions com ara reconeixement de veu o traducció automàtica. Els dispositius mòbils depenen cada cop més de les DNNs per entendre el món. Els telèfons i rellotges intel·ligents, o fins i tot els cotxes, realitzen diàriament tasques discriminatòries com ara el reconeixement de rostres o objectes. Malgrat la popularitat creixent de les DNNs, el seu funcionament en sistemes mòbils presenta diversos reptes: proporcionar una alta precisió i rendiment amb un petit pressupost de memòria i energia. Les DNNs modernes consisteixen en milions de paràmetres que requereixen recursos computacionals i de memòria enormes i, per tant, no es poden utilitzar directament en sistemes de baixa potència amb recursos limitats. L'objectiu d'aquesta tesi és abordar aquests problemes i proposar noves solucions per tal de dissenyar acceleradors eficients per a sistemes de computació cognitiva basats en DNNs. En primer lloc, ens centrem en optimitzar la inferència de les DNNs per a aplicacions de processament de seqüències. Realitzem una anàlisi de la similitud de les entrades entre execucions consecutives de les DNNs. A continuació, proposem DISC, un accelerador que implementa una tècnica de càlcul diferencial, basat en l'alt grau de semblança de les entrades, per reutilitzar els càlculs de l'execució anterior, en lloc de computar tota la xarxa. Observem que, de mitjana, més del 60% de les entrades de qualsevol capa de les DNNs utilitzades presenten canvis menors respecte a l'execució anterior. Evitar els accessos de memòria i càlculs d'aquestes entrades comporta un estalvi d'energia del 63% de mitjana. En segon lloc, proposem optimitzar la inferència de les DNNs basades en capes FC. Primer analitzem el nombre de pesos únics per neurona d'entrada en diverses xarxes. Aprofitant optimitzacions comunes com la quantització lineal, observem un nombre molt reduït de pesos únics per entrada en diverses capes FC de DNNs modernes. A continuació, per millorar l'eficiència energètica del càlcul de les capes FC, presentem CREW, un accelerador que implementa un eficient mecanisme de reutilització de càlculs i emmagatzematge dels pesos. CREW redueix el nombre de multiplicacions i proporciona estalvis importants en l'ús de la memòria. Avaluem CREW en un conjunt divers de DNNs modernes. CREW proporciona, de mitjana, una millora en rendiment de 2,61x i un estalvi d'energia de 2,42x. En tercer lloc, proposem un mecanisme per optimitzar la inferència de les RNNs. Les cel·les de les xarxes recurrents realitzen multiplicacions element a element de les activacions de diferents comportes, sigmoides i tanh sent les funcions habituals d'activació. Realitzem una anàlisi dels valors de les funcions d'activació i mostrem que una fracció significativa està saturada cap a zero o un en un conjunto d'RNNs populars. A continuació, proposem CGPA per podar dinàmicament les activacions de les RNNs a una granularitat gruixuda. CGPA evita l'avaluació de neurones senceres cada vegada que les sortides de neurones parelles estan saturades. CGPA redueix significativament la quantitat de càlculs i accessos a la memòria, aconseguint en mitjana un 12% de millora en el rendiment i estalvi d'energia. Finalment, en l'última contribució d'aquesta tesi ens centrem en metodologies de poda estàtica de les DNNs. La poda redueix la petjada de memòria i el treball computacional mitjançant l'eliminació de connexions o neurones redundants. Tanmateix, mostrem que els esquemes de poda previs fan servir un procés iteratiu molt llarg que requereix l'entrenament de les DNNs moltes vegades per ajustar els paràmetres de poda. A continuació, proposem un esquema de poda basat en l'anàlisi de components principals i la importància relativa de les connexions de cada neurona que optimitza automàticament el DNN optimitzat en un sol tret sense necessitat de sintonitzar manualment múltiples paràmetresPostprint (published version

    Design and Implementation of Hardware Accelerators for Neural Processing Applications

    Full text link
    Primary motivation for this work was the need to implement hardware accelerators for a newly proposed ANN structure called Auto Resonance Network (ARN) for robotic motion planning. ARN is an approximating feed-forward hierarchical and explainable network. It can be used in various AI applications but the application base was small. Therefore, the objective of the research was twofold: to develop a new application using ARN and to implement a hardware accelerator for ARN. As per the suggestions given by the Doctoral Committee, an image recognition system using ARN has been implemented. An accuracy of around 94% was achieved with only 2 layers of ARN. The network also required a small training data set of about 500 images. Publicly available MNIST dataset was used for this experiment. All the coding was done in Python. Massive parallelism seen in ANNs presents several challenges to CPU design. For a given functionality, e.g., multiplication, several copies of serial modules can be realized within the same area as a parallel module. Advantage of using serial modules compared to parallel modules under area constraints has been discussed. One of the module often useful in ANNs is a multi-operand addition. One problem in its implementation is that the estimation of carry bits when the number of operands changes. A theorem to calculate exact number of carry bits required for a multi-operand addition has been presented in the thesis which alleviates this problem. The main advantage of the modular approach to multi-operand addition is the possibility of pipelined addition with low reconfiguration overhead. This results in overall increase in throughput for large number of additions, typically seen in several DNN configurations

    Power-Efficient Accelerator Design for Neural Networks Using Computation Reuse

    No full text
    corecore