17 research outputs found

    Использование разделяемой памяти платформы CUDA в параллельной реализации искусственной нейронной сети прямого распространения

    Get PDF
    В статье рассматриваются влияние способа использования разделяемой памяти на производительность реализации искусственной нейронной сети на платформе CUDA. Рассматриваются варианты размещения нескольких окон исходных данных и весовых коэффициентов в разделяемой памяти. Показано, что из-за нерационального использования времени ожидания загрузки данных из глобальной памяти производительность этих вариантов не превосходит производительности базовой схемы распараллеливания.The performance of several schemes of shared memory usage in artificial neural network implementation on a CUDA platform is considered. The placement of several windows of input data and neuron inputs weights in shared memory is investigated. It is shown, that due to waiting while data is loaded from global memory, performance of these schemes doesn’t exceed the performance of basic scheme of parallelization

    CENTRAL PROCESSING UNIT-GRAPHICS PROCESSING UNIT COMPUTING SCHEME FOR MULTI-OBJECT TRACKING IN SURVEILLANCE

    Get PDF
    This research work presents a novel central processing unit-graphics processing unit (CPU-GPU) computing scheme for multiple object trackingduring a surveillance operation. This facilitates nonlinear computational jobs to avail completion of computation in minimal processing time for tracking function. The work is divided into two essential objectives. First is to dynamically divide the processing operations into parallel units, and second is to reduce the communication between CPU-GPU processing units

    Acceleration of stereo-matching on multi-core CPU and GPU

    Get PDF
    This paper presents an accelerated version of a dense stereo-correspondence algorithm for two different parallelism enabled architectures, multi-core CPU and GPU. The algorithm is part of the vision system developed for a binocular robot-head in the context of the CloPeMa 1 research project. This research project focuses on the conception of a new clothes folding robot with real-time and high resolution requirements for the vision system. The performance analysis shows that the parallelised stereo-matching algorithm has been significantly accelerated, maintaining 12x and 176x speed-up respectively for multi-core CPU and GPU, compared with non-SIMD singlethread CPU. To analyse the origin of the speed-up and gain deeper understanding about the choice of the optimal hardware, the algorithm was broken into key sub-tasks and the performance was tested for four different hardware architectures

    Graphics processor unit hardware acceleration of Levenberg-Marquardt artificial neural network training

    Get PDF
    This paper makes two principal contributions. The first is that there appears to be no previous a description in the research literature of an artificial neural network implementation on a graphics processor unit (GPU) that uses the Levenberg-Marquardt (LM) training method. The second is an initial attempt at determining when it is computationally beneficial to exploit a GPU’s parallel nature in preference to the traditional implementation on a central processing unit (CPU). The paper describes the approach taken to successfully implement the LM method, discusses the advantages of this approach for GPU implementation and presents results that compare GPU and CPU performance on two test data sets

    Параллельная обработка потока данных искусственными нейронными сетями на платформе CUDA

    Get PDF
    The software implementation of a backpropagation artificial neural network (ANN) on massive parallel computing platform CUDA is proposedПредложена схема программной реализации искусственной нейронной сети (ИНС) прямого распространения на платформе массовых параллельных вычислений CUDAЗапропоновано схему програмної реалізації штучної нейронної мережі (ШНМ) прямого розповсюдження на платформі масових паралельних обчислень CUD

    Portable GPU-Based Artificial Neural Networks For Data-Driven Modeling

    Full text link
    Artificial neural network (ANN) is widely applied as data-driven modeling tool in hydroinformatics due to its broad applicability of handing implicit and nonlinear relationships between the input and output data. To obtain a reliable ANN model, training ANN using the data is essential, but the training is usually taking many hours for large data set and/or for large systems with many variants. This may not be a concern when ANN is trained for offline applications, but it is of great importance when ANN is trained or retrained for real-time and near real-time applications, which are becoming an increasingly interested research theme while the hydroinformatics tools will be an integral part of smart city operation system. Based on author’s previous research projects, which proved that GPU-based ANN is more than 10X efficient than CPU-based ANN for constructing the meta-model (fast simulation), applied as a surrogate of the physics-based model (slow simulation). This paper presents the latest development of GPU-based ANN computing kernels that is implemented with OpenCL an Open Compute Language. The generalized ANN can be used an efficient machine learning library for data-driven modeling. The performance of the implemented library has been tested with the benchmark example and compared with the previous results

    Study of Camera Spectral Reflectance Reconstruction Performance using CPU and GPU Artificial Neural Network Modelling

    Get PDF
    Reconstruction of reflectance spectra from camera RGB values is possible, if characteristics of the illumination source, optics and sensors are known. If not, additional information about these has to be somehow acquired. If alongside with pictures taken, RGB values of some colour patches with known reflectance spectra are obtained under the same illumination conditions, the reflectance reconstruction models can be created based on artificial neural networks (ANN). In Matlab, multilayer feedforward networks can be trained using different algorithms. In our study we hypothesized that the scaled conjugate gradient back propagation (BP) algorithm when executed on Graphics Processing Unit, is very fast, but in terms of convergence and performance, it does not match Levenberg-Marquardt algorithm (LM), which, on the other hand, executes only on CPU and is therefore much more time-consuming. We also presumed that there exists a correlation between the two algorithms and is manifested through a dependency of MSE to the number of hidden layer neurons, and therefore the faster BP algorithm could be used to narrow the search span with the LM algorithm to find the best ANN for reflectance reconstruction. The conducted experiment confirmed speed superiority of the BP algorithm but also confirmed better convergence and accuracy of reflectance reconstruction with the LM algorithm. The correlation of reflectance recovery results with ANNs modelled by both training algorithms was confirmed, and a strong correlation was found between the 3rd order polynomial approximation of the LM and BP algorithm\u27s test performances for both mean and best performance

    Acceleration of stereo-matching on multi-core CPU and GPU

    Get PDF
    This paper presents an accelerated version of a dense stereo-correspondence algorithm for two different parallelism enabled architectures, multi-core CPU and GPU. The algorithm is part of the vision system developed for a binocular robot-head in the context of the CloPeMa 1 research project. This research project focuses on the conception of a new clothes folding robot with real-time and high resolution requirements for the vision system. The performance analysis shows that the parallelised stereo-matching algorithm has been significantly accelerated, maintaining 12x and 176x speed-up respectively for multi-core CPU and GPU, compared with non-SIMD singlethread CPU. To analyse the origin of the speed-up and gain deeper understanding about the choice of the optimal hardware, the algorithm was broken into key sub-tasks and the performance was tested for four different hardware architectures
    corecore