    Mixing multi-core CPUs and GPUs for scientific simulation software

    Recent technological and economic developments have led to widespread availability of multi-core CPUs and specialist accelerator processors such as graphical processing units (GPUs). The accelerated computational performance possible from these devices can be very high for some applications paradigms. Software languages and systems such as NVIDIA's CUDA and Khronos consortium's open compute language (OpenCL) support a number of individual parallel application programming paradigms. To scale up the performance of some complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica- tions using threading approaches and multi-core CPUs to control independent GPU devices. We present speed-up data and discuss multi-threading software issues for the applications level programmer and o er some suggested areas for language development and integration between coarse-grained and ne-grained multi-thread systems. We discuss results from three common simulation algorithmic areas including: partial di erential equations; graph cluster metric calculations and random number generation. We report on programming experiences and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs; a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and trends in multi-core programming for scienti c applications developers

    GPU optimizations for a production molecular docking code

    Thesis (M.Sc.Eng.) -- Boston UniversityScientists have always felt the desire to perform computationally intensive tasks that surpass the capabilities of conventional single core computers. As a result of this trend, Graphics Processing Units (GPUs) have come to be increasingly used for general computation in scientific research. This field of GPU acceleration is now a vast and mature discipline. Molecular docking, the modeling of the interactions between two molecules, is a particularly computationally intensive task that has been the subject of research for many years. It is a critical simulation tool used for the screening of protein compounds for drug design and in research of the nature of life itself. The PIPER molecular docking program was previously accelerated using GPUs, achieving a notable speedup over conventional single core implementation. Since its original release the development of the CPU based PIPER has not ceased, and it is now a mature and fast parallel code. The GPU version, however, still contains many potential points for optimization. In the current work, we present a new version of GPU PIPER that attains a 3.3x speedup over a parallel MPI version of PIPER running on an 8 core machine and using the optimized Intel Math Kernel Library. We achieve this speedup by optimizing existing kernels for modern GPU architectures and migrating critical code segments to the GPU. In particular, we both improve the runtime of the filtering and scoring stages by more than an order of magnitude, and move all molecular data permanently to the GPU to improve data locality. This new speedup is obtained while retaining a computational accuracy virtually identical to the CPU based version. We also demonstrate that, due to the algorithmic dependencies of the PIPER algorithm on the 3D Fast Fourier Transform, our GPU PIPER will likely remain proportionally faster than equivalent CPU based implementations, and with little room for further optimizations. This new GPU accelerated version of PIPER is integrated as part of the ClusPro molecular docking and analysis server at Boston University. ClusPro has over 4000 registered users and more than 50000 jobs run over the past 4 years

    A performance focused, development friendly and model aided parallelization strategy for scientific applications

    The amelioration of high performance computing platforms has provided unprecedented computing power with the evolution of multi-core CPUs, massively parallel architectures such as General Purpose Graphics Processing Units (GPGPUs) and Many Integrated Core (MIC) architectures such as Intel\u27s Xeon phi coprocessor. However, it is a great challenge to leverage capabilities of such advanced supercomputing hardware, as it requires efficient and effective parallelization of scientific applications. This task is difficult mainly due to complexity of scientific algorithms coupled with the variety of available hardware and disparate programming models. To address the aforementioned challenges, this thesis presents a parallelization strategy to accelerate scientific applications that maximizes the opportunities of achieving speedup while minimizing the development efforts. Parallelization is a three step process (1) choose a compatible combination of architecture and parallel programming language, (2) translate base code/algorithm to a parallel language and (3) optimize and tune the application. In this research, a quantitative comparison of run time for various implementations of k-means algorithm, is used to establish that native languages (OpenMP, MPI, CUDA) perform better on respective architectures as opposed to vendor-neutral languages such as OpenCL. A qualitative model is used to select an optimal architecture for a given application by aligning the capabilities of accelerators with characteristics of the application. Once the optimal architecture is chosen, the corresponding native language is employed. This approach provides the best performance with reasonable accuracy (78%) of predicting a fitting combination, while eliminating the need for exploring different architectures individually. It reduces the required development efforts considerably as the application need not be re-written in multiple languages. The focus can be solely on optimization and tuning to achieve the best performance on available architectures with minimized investment in terms of cost and efforts. To verify the prediction accuracy of the qualitative model, the OpenDwarfs benchmark suite, which implements the Berkeley\u27s dwarfs in OpenCL, is used. A dwarf is an algorithmic method that captures a pattern of computation and communication. For the purpose of this research, the focus is on 9 application from various algorithmic domains that cover the seven dwarfs of symbolic computation, which were identified by Phillip Colella, as omnipresent in scientific and engineering applications. To validate the parallelization strategy collectively, a case study is undertaken. This case study involves parallelization of the Lower Upper Decomposition for the Gaussian Elimination algorithm from the linear algebra domain, using conventional trial and error methods as well as the proposed \u27Architecture First, Language Later\u27\u27 strategy. The development efforts incurred are contrasted for both methods. The aforesaid proposed strategy is observed to reduce the development efforts by an average of 50%

    FPGA Acceleration of 3GPP Channel Model Emulator for 5G New Radio

    The channel model is by far the most computing intensive part of the link level simulations of multiple-input and multiple-output (MIMO) fifth-generation new radio (5G NR) communication systems. Simulation effort further increases when using more realistic geometry-based channel models, such as the three-dimensional spatial channel model (3D-SCM). Channel emulation is used for functional and performance verification of such models in the network planning phase. These models use multiple finite impulse response (FIR) filters and have a very high degree of parallelism which can be exploited for accelerated execution on Field Programmable Gate Array (FPGA) and Graphics Processing Unit (GPU) platforms. This paper proposes an efficient re-configurable implementation of the 3rd generation partnership project (3GPP) 3D-SCM on FPGAs using a design flow based on high-level synthesis (HLS). It studies the effect of various HLS optimization techniques on the total latency and hardware resource utilization on Xilinx Alveo U280 and Intel Arria 10GX 1150 high-performance FPGAs, using in both cases the commercial HLS tools of the producer. The channel model accuracy is preserved using double precision floating point arithmetic. This work analyzes in detail the effort to target the FPGA platforms using HLS tools, both in terms of common parallelization effort (shared by both FPGAs), and in terms of platform-specific effort, different for Xilinx and Intel FPGAs. Compared to the baseline general-purpose central processing unit (CPU) implementation, the achieved speedups are 65X and 95X using the Xilinx UltraScale+ and Intel Arria FPGA platform respectively, when using a Double Data Rate (DDR) memory interface. The FPGA-based designs also achieved ~3X better performance compared to a similar technology node NVIDIA GeForce GTX 1070 GPU, while consuming ~4X less energy. The FPGA implementation speedup improves up to 173X over the CPU baseline when using the Xilinx UltraRAM (URAM) and High-Bandwidth Memory (HBM) resources, also achieving 6X lower latency and 12X lower energy consumption than the GPU implementation

    FPGA Based Acceleration of Matrix Decomposition and Clustering Algorithm Using High Level Synthesis

    FPGAs have shown great promise for accelerating computationally intensive algorithms. However, FPGA-based accelerator design is tedious and time consuming if we rely on traditional HDL based design method. Recent introduction of Altera SDK for OpenCL (AOCL) high level synthesis tool enables developers to utilize FPGA’s potential without long development time and extensive hardware knowledge. AOCL is used in this thesis to accelerate computationally intensive algorithms in the field of machine learning and scientific computing. The algorithms studied are k-means clustering, k-nearest neighbour search, N-body simulation and LU decomposition. The performance and power consumption of the algorithms synthesized using AOCL for FPGA are evaluated against state of the art CPU and GPU implementations. The k-means clustering and k-nearest neighbor kernels designed for FPGA significantly out-performed optimized CPU implementations while achieving similar or better power efficiency than that of GPU

    Computing Large-scale Distance Matrices on GPU

    Abstract-A distance matrix is simply an n×n two-dimensional array that contains pairwise distances of a set of n points in a metric space. It has a wide range of usage in several fields of scientific research e.g., data clustering, machine learning, pattern recognition, image analysis, information retrieval, signal processing, bioinformatics etc. However, as the size of n increases, the computation of distance matrix becomes very slow or incomputable on traditional general purpose computers. In this paper, we propose an inexpensive and scalable data-parallel solution to this problem by dividing the computational tasks and data on GPUs. We demonstrate the performance of our method on a set of real-world biological networks constructed from a renowned breast cancer study

    Neural networks using for handwriting numbers recognition

    V prezentované práci, Hopfieldova neuronová síť byla postavena pro rozpoznávání ručně psaného číslice vzory obsažené v MNIST databáze. Pro každou číslici bylo vybudováno deset neuronových sítí Hopfieldu. Středy shluků, které byly postaveny s využitím neuronové sítě Kohonen byly brány jako objekty pro "zapamatování". Byly navrženy dvě metody, které jsou podporovaným krokem v hopfieldské neurální síti; byla provedena analýza těchto metod. Také, chyba byla vypočtena pro každé metody, výhody a nevýhody jejich použití byly identifikovány. Seskupení ručně psaných číslic z tréninkového vzorku MNIST databáze se provádí. Clustering is performed using a Kohonen neural network. Pro každou číslici je zvolen optimální počet seskupení (nepřesahující 50). As a metric for Kohonen network, the Euclidean norm is used. Síť je vycvičena sériovým algoritmem na procesoru a paralelním algoritmem na GPU pomocí technologie CUDA. Grafy času stráveného tréninkem neurální sítě pro každou číslici jsou uvedeny. Je prezentováno srovnání času stráveného sériovým a paralelním tréninkem. Bylo zjištěno, že průměrná hodnota zrychlení výcviku neurální sítě pomocí technologie CUDA je téměř 17krát vyšší. Číslice ze zkušebního vzorku databáze MNIST se používají k vyhodnocení přesnosti stavby seskupení. Bylo zjištěno, že procento vektorů ze zkušebního vzorku ve správném seskupení pro každou číslici je více než 90%. Vypočítá se F-míra pro každou číslici. Nejlepší hodnoty F-measure jsou získány pro 0 a 1 (F-measure je 0.974), vzhledem k tomu, že nejhorší hodnoty jsou získány pro číslici 9 (F-measure je 0.903). Úvod stručně popisuje obsah práce, jaký výzkum je v současné době k dispozici, a význam této práce. Po tom následuje prohlášení o problému, stejně jako o tom, jaké technologie byly použity k psaní této práce. První kapitola popisuje teoretické aspekty, stejně jako popisuje, jak řešit každou fázi této práce. Druhá kapitola obsahuje popis programu práce a získané výsledky. Ve druhé kapitole mluvíme o paralelizaci výukového algoritmu Kohonenovy neurální sítě. Ve třetí kapitole je software testován. Výsledky jsou uznání reakci každé neuronové sítě - obraz je nejvíce podobný obraz předložené pro vstup, a také celkové procento uznání za každé neuronové sítě.In the presented work, a Hopfield neural network was constructed for recognizing handwritten digit patterns contained in the MNIST database. Ten Hopfield neural networks were built for each digit separately. The centers of clusters that were built using the Kohonen neural network were taken as objects for “memorization”. Two methods were proposed, which are a supported step in a Hopfield neural network; an analysis of these methods was carried out. Also, an error was calculated for each method, the pros and cons of their use were identified. Clustering of handwritten digits from the training sample of the MNIST database is conducted. Clustering is performed using a Kohonen neural network. The optimal number of clusters (not exceeding 50) for each digit is selected. As a metric for Kohonen network, the Euclidean norm is used. The network is trained by a serial algorithm on the CPU and by a parallel algorithm on the GPU using CUDA technology. The graphs of the time spent on training the neural network for each digit are given. A comparison of the time spent for serial and parallel training is presented. It is found that the average value of accelerating the training of a neural network using CUDA technology is almost 17-fold. The digits from the test sample of the MNIST database are used to evaluate the accuracy of building the cluster. It is found that the percentage of vectors from the test sample in the correct cluster for each digit is more than 90%. The F-measure for each digit is calculated. The best values of the F-measure are obtained for 0 and 1 (F-measure is 0.974), whereas the worst values are obtained for the digit 9 (F-measure is 0.903). The introduction briefly describes the content of the work, what research is currently available, and the relevance of this work. This is followed by a statement of the problem, as well as what technologies were used to write this work. The first chapter describes the theoretical aspects, as well as describes how to solve each stage of this work. The second chapter contains a program description of the work and the results obtained. In the second chapter, we talk about parallelizing the learning algorithm of the Kohonen neural network. In the third chapter, the software is tested. The results are the recognition response of each neural network - the image is the most similar to the image submitted for input, also, the total percentage of recognition for each neural network

    FPGA-Based Acceleration of the Self-Organizing Map (SOM) Algorithm using High-Level Synthesis

    One of the fastest growing and the most demanding areas of computer science is Machine Learning (ML). Self-Organizing Map (SOM), categorized as unsupervised ML, is a popular data-mining algorithm widely used in Artificial Neural Network (ANN) for mapping high dimensional data into low dimensional feature maps. SOM, being computationally intensive, requires high computational time and power when dealing with large datasets. Acceleration of many computationally intensive algorithms can be achieved using Field-Programmable Gate Arrays (FPGAs) but it requires extensive hardware knowledge and longer development time when employing traditional Hardware Description Language (HDL) based design methodology. Open Computing Language (OpenCL) is a standard framework for writing parallel computing programs that execute on heterogeneous computing systems. Intel FPGA Software Development Kit for OpenCL (IFSO) is a High-Level Synthesis (HLS) tool that provides a more efficient alternative to HDL-based design. This research presents an optimized OpenCL implementation of SOM algorithm on Stratix V and Arria 10 FPGAs using IFSO. Compared to recent SOM implementations on Central Processing Unit (CPU) and Graphics Processing Unit (GPU), our OpenCL implementation on FPGAs provides superior speed performance and power consumption results. Stratix V achieves speedup of 1.41x - 16.55x compared to AMD and Intel CPU and 2.18x compared to Nvidia GPU whereas Arria 10 achieves speedup of 1.63x - 19.15x compared to AMD and Intel CPU and 2.52x compared to Nvidia GPU. In terms of power consumption, Stratix V is 35.53x and 42.53x whereas Arria 10 is 15.82x and 15.93x more power efficient compared to CPU and GPU respectively

    Optimización en GPU de algoritmos para la mejora del realce y segmentación en imágenes hepáticas

    This doctoral thesis deepens the GPU acceleration for liver enhancement and segmentation. With this motivation, detailed research is carried out here in a compendium of articles. The work developed is structured in three scientific contributions, the first one is based upon enhancement and tumor segmentation, the second one explores the vessel segmentation and the last is published on liver segmentation. These works are implemented on GPU with significant speedups with great scientific impact and relevance in this doctoral thesis The first work proposes cross-modality based contrast enhancement for tumor segmentation on GPU. To do this, it takes target and guidance images as an input and enhance the low quality target image by applying two dimensional histogram approach. Further it has been observed that the enhanced image provides more accurate tumor segmentation using GPU based dynamic seeded region growing. The second contribution is about fast parallel gradient based seeded region growing where static approach has been proposed and implemented on GPU for accurate vessel segmentation. The third contribution describes GPU acceleration of Chan-Vese model and cross-modality based contrast enhancement for liver segmentation