1,353 research outputs found
A Multi-signal Variant for the GPU-based Parallelization of Growing Self-Organizing Networks
Among the many possible approaches for the parallelization of self-organizing
networks, and in particular of growing self-organizing networks, perhaps the
most common one is producing an optimized, parallel implementation of the
standard sequential algorithms reported in the literature. In this paper we
explore an alternative approach, based on a new algorithm variant specifically
designed to match the features of the large-scale, fine-grained parallelism of
GPUs, in which multiple input signals are processed at once. Comparative tests
have been performed, using both parallel and sequential implementations of the
new algorithm variant, in particular for a growing self-organizing network that
reconstructs surfaces from point clouds. The experimental results show that
this approach allows harnessing in a more effective way the intrinsic
parallelism that the self-organizing networks algorithms seem intuitively to
suggest, obtaining better performances even with networks of smaller size.Comment: 17 page
Somoclu: An Efficient Parallel Library for Self-Organizing Maps
Somoclu is a massively parallel tool for training self-organizing maps on
large data sets written in C++. It builds on OpenMP for multicore execution,
and on MPI for distributing the workload across the nodes in a cluster. It is
also able to boost training by using CUDA if graphics processing units are
available. A sparse kernel is included, which is useful for high-dimensional
but sparse data, such as the vector spaces common in text mining workflows.
Python, R and MATLAB interfaces facilitate interactive use. Apart from fast
execution, memory use is highly optimized, enabling training large emergent
maps even on a single computer.Comment: 26 pages, 9 figures. The code is available at
https://peterwittek.github.io/somoclu
FPGA-Based Acceleration of the Self-Organizing Map (SOM) Algorithm using High-Level Synthesis
One of the fastest growing and the most demanding areas of computer science is Machine Learning (ML). Self-Organizing Map (SOM), categorized as unsupervised ML, is a popular data-mining algorithm widely used in Artificial Neural Network (ANN) for mapping high dimensional data into low dimensional feature maps. SOM, being computationally intensive, requires high computational time and power when dealing with large datasets. Acceleration of many computationally intensive algorithms can be achieved using Field-Programmable Gate Arrays (FPGAs) but it requires extensive hardware knowledge and longer development time when employing traditional Hardware Description Language (HDL) based design methodology. Open Computing Language (OpenCL) is a standard framework for writing parallel computing programs that execute on heterogeneous computing systems. Intel FPGA Software Development Kit for OpenCL (IFSO) is a High-Level Synthesis (HLS) tool that provides a more efficient alternative to HDL-based design. This research presents an optimized OpenCL implementation of SOM algorithm on Stratix V and Arria 10 FPGAs using IFSO. Compared to recent SOM implementations on Central Processing Unit (CPU) and Graphics Processing Unit (GPU), our OpenCL implementation on FPGAs provides superior speed performance and power consumption results. Stratix V achieves speedup of 1.41x - 16.55x compared to AMD and Intel CPU and 2.18x compared to Nvidia GPU whereas Arria 10 achieves speedup of 1.63x - 19.15x compared to AMD and Intel CPU and 2.52x compared to Nvidia GPU. In terms of power consumption, Stratix V is 35.53x and 42.53x whereas Arria 10 is 15.82x and 15.93x more power efficient compared to CPU and GPU respectively
Enhancing Performance of Parallel Self-Organizing Map on Large Dataset with Dynamic Parallel and Hyper-Q
Self-Organizing Map (SOM) is an unsupervised artificial neural network algorithm. Even though this algorithm is known to be an appealing clustering method,many efforts to improve its performance are still pursued in various research works. In order to gain faster computation time, for instance, running SOM in parallel had been focused in many previous research works. Utilization of the Graphics Processing Unit (GPU) as a parallel calculation engine is also continuously improved. However, total computation time in parallel SOM is still not optimal on processing large dataset. In this research, we propose a combination of Dynamic Parallel and Hyper-Q to further improve the performance of parallel SOM in terms of faster computing time. Dynamic Parallel and Hyper-Q are utilized on the process of calculating distance and searching best-matching unit (BMU), while updating weight and its neighbors are performed using Hyper-Q only. Result of this study indicates an increase in SOM parallel performance up to two times faster compared to those without using Dynamic Parallel and Hyper-Q
3D model reconstruction using neural gas accelerated on GPU
In this work, we propose the use of the neural gas (NG), a neural network that uses an unsupervised Competitive Hebbian Learning (CHL) rule, to develop a reverse engineering process. This is a simple and accurate method to reconstruct objects from point clouds obtained from multiple overlapping views using low-cost sensors. In contrast to other methods that may need several stages that include downsampling, noise filtering and many other tasks, the NG automatically obtains the 3D model of the scanned objects. To demonstrate the validity of our proposal we tested our method with several models and performed a study of the neural network parameterization computing the quality of representation and also comparing results with other neural methods like growing neural gas and Kohonen maps or classical methods like Voxel Grid. We also reconstructed models acquired by low cost sensors that can be used in virtual and augmented reality environments for redesign or manipulation purposes. Since the NG algorithm has a strong computational cost we propose its acceleration. We have redesigned and implemented the NG learning algorithm to fit it onto Graphics Processing Units using CUDA. A speed-up of 180Ă— faster is obtained compared to the sequential CPU version.This work was partially funded by the Spanish Government DPI2013-40534-R grant
Evolution of a double-front Rayleigh-Taylor system using a GPU-based high resolution thermal Lattice-Boltzmann model
We study the turbulent evolution originated from a system subjected to a
Rayleigh-Taylor instability with a double density at high resolution in a 2
dimensional geometry using a highly optimized thermal Lattice Boltzmann code
for GPUs. The novelty of our investigation stems from the initial condition,
given by the superposition of three layers with three different densities,
leading to the development of two Rayleigh-Taylor fronts that expand upward and
downward and collide in the middle of the cell. By using high resolution
numerical data we highlight the effects induced by the collision of the two
turbulent fronts in the long time asymptotic regime. We also provide details on
the optimized Lattice-Boltzmann code that we have run on a cluster of GPU
GPUMLib: Deep Learning SOM Library for Surface Reconstruction
The evolution of 3D scanning devices and innovation in computer
processing power and storage capacity has sparked the revolution of
producing big point-cloud datasets. This phenomenon has becoming
an integral part of the sophisticated building design process
especially in the era of 4th Industrial Revolution. The big point-cloud
datasets have caused complexity in handling surface reconstruction
and visualization since existing algorithms are not so readily
available. In this context, the surface reconstruction intelligent
algorithms need to be revolutionized to deal with big point-cloud
datasets in tandem with the advancement of hardware processing
power and storage capacity. In this study, we propose GPUMLib –
deep learning library for self-organizing map (SOM-DLLib) to solve
problems involving big point-cloud datasets from 3D scanning
devices. The SOM-DLLib consists of multiple layers for reducing
and optimizing those big point cloud datasets. The findings show the
final objects are successfully reconstructed with optimized
neighborhood representation and the performance becomes better as
the size of point clouds increases
XPySom: High-performance self-organizing maps
In this paper, we introduce XPySom, a new open-source Python implementation of the well-known Self-Organizing Maps (SOM) technique. It is designed to achieve high performance on a single node, exploiting widely available Python libraries for vector processing on multi-core CPUs and GP-GPUs. We present results from an extensive experimental evaluation of XPySom in comparison to widely used open-source SOM implementations, showing that it outperforms the other available alternatives. Indeed, our experimentation carried out using the Extended MNIST open data set shows a speed-up of about 7x and 100x when compared to the best open-source multi-core implementations we could find with multi-core and GP-GPU acceleration, respectively, achieving the same accuracy levels in terms of quantization error
Distributed learning of CNNs on heterogeneous CPU/GPU architectures
Convolutional Neural Networks (CNNs) have shown to be powerful classification
tools in tasks that range from check reading to medical diagnosis, reaching
close to human perception, and in some cases surpassing it. However, the
problems to solve are becoming larger and more complex, which translates to
larger CNNs, leading to longer training times that not even the adoption of
Graphics Processing Units (GPUs) could keep up to. This problem is partially
solved by using more processing units and distributed training methods that are
offered by several frameworks dedicated to neural network training. However,
these techniques do not take full advantage of the possible parallelization
offered by CNNs and the cooperative use of heterogeneous devices with different
processing capabilities, clock speeds, memory size, among others. This paper
presents a new method for the parallel training of CNNs that can be considered
as a particular instantiation of model parallelism, where only the
convolutional layer is distributed. In fact, the convolutions processed during
training (forward and backward propagation included) represent from -\%
of global processing time. The paper analyzes the influence of network size,
bandwidth, batch size, number of devices, including their processing
capabilities, and other parameters. Results show that this technique is capable
of diminishing the training time without affecting the classification
performance for both CPUs and GPUs. For the CIFAR-10 dataset, using a CNN with
two convolutional layers, and and kernels, respectively, best
speedups achieve using four CPUs and with three GPUs.
Modern imaging datasets, larger and more complex than CIFAR-10 will certainly
require more than -\% of processing time calculating convolutions, and
speedups will tend to increase accordingly
- …