1,353 research outputs found

    A Multi-signal Variant for the GPU-based Parallelization of Growing Self-Organizing Networks

    Full text link
    Among the many possible approaches for the parallelization of self-organizing networks, and in particular of growing self-organizing networks, perhaps the most common one is producing an optimized, parallel implementation of the standard sequential algorithms reported in the literature. In this paper we explore an alternative approach, based on a new algorithm variant specifically designed to match the features of the large-scale, fine-grained parallelism of GPUs, in which multiple input signals are processed at once. Comparative tests have been performed, using both parallel and sequential implementations of the new algorithm variant, in particular for a growing self-organizing network that reconstructs surfaces from point clouds. The experimental results show that this approach allows harnessing in a more effective way the intrinsic parallelism that the self-organizing networks algorithms seem intuitively to suggest, obtaining better performances even with networks of smaller size.Comment: 17 page

    Somoclu: An Efficient Parallel Library for Self-Organizing Maps

    Get PDF
    Somoclu is a massively parallel tool for training self-organizing maps on large data sets written in C++. It builds on OpenMP for multicore execution, and on MPI for distributing the workload across the nodes in a cluster. It is also able to boost training by using CUDA if graphics processing units are available. A sparse kernel is included, which is useful for high-dimensional but sparse data, such as the vector spaces common in text mining workflows. Python, R and MATLAB interfaces facilitate interactive use. Apart from fast execution, memory use is highly optimized, enabling training large emergent maps even on a single computer.Comment: 26 pages, 9 figures. The code is available at https://peterwittek.github.io/somoclu

    FPGA-Based Acceleration of the Self-Organizing Map (SOM) Algorithm using High-Level Synthesis

    Get PDF
    One of the fastest growing and the most demanding areas of computer science is Machine Learning (ML). Self-Organizing Map (SOM), categorized as unsupervised ML, is a popular data-mining algorithm widely used in Artificial Neural Network (ANN) for mapping high dimensional data into low dimensional feature maps. SOM, being computationally intensive, requires high computational time and power when dealing with large datasets. Acceleration of many computationally intensive algorithms can be achieved using Field-Programmable Gate Arrays (FPGAs) but it requires extensive hardware knowledge and longer development time when employing traditional Hardware Description Language (HDL) based design methodology. Open Computing Language (OpenCL) is a standard framework for writing parallel computing programs that execute on heterogeneous computing systems. Intel FPGA Software Development Kit for OpenCL (IFSO) is a High-Level Synthesis (HLS) tool that provides a more efficient alternative to HDL-based design. This research presents an optimized OpenCL implementation of SOM algorithm on Stratix V and Arria 10 FPGAs using IFSO. Compared to recent SOM implementations on Central Processing Unit (CPU) and Graphics Processing Unit (GPU), our OpenCL implementation on FPGAs provides superior speed performance and power consumption results. Stratix V achieves speedup of 1.41x - 16.55x compared to AMD and Intel CPU and 2.18x compared to Nvidia GPU whereas Arria 10 achieves speedup of 1.63x - 19.15x compared to AMD and Intel CPU and 2.52x compared to Nvidia GPU. In terms of power consumption, Stratix V is 35.53x and 42.53x whereas Arria 10 is 15.82x and 15.93x more power efficient compared to CPU and GPU respectively

    Enhancing Performance of Parallel Self-Organizing Map on Large Dataset with Dynamic Parallel and Hyper-Q

    Get PDF
    Self-Organizing Map (SOM) is an unsupervised artificial neural network algorithm. Even though this algorithm is known to be an appealing clustering method,many efforts to improve its performance are still pursued in various research works. In order to gain faster computation time, for instance, running SOM in parallel had been focused in many previous research works. Utilization of the Graphics Processing Unit (GPU) as a parallel calculation engine is also continuously improved. However, total computation time in parallel SOM is still not optimal on processing large dataset. In this research, we propose a combination of Dynamic Parallel and Hyper-Q to further improve the performance of parallel SOM in terms of faster computing time. Dynamic Parallel and Hyper-Q are utilized on the process of calculating distance and searching best-matching unit (BMU), while updating weight and its neighbors are performed using Hyper-Q only. Result of this study indicates an increase in SOM parallel performance up to two times faster compared to those without using Dynamic Parallel and Hyper-Q

    3D model reconstruction using neural gas accelerated on GPU

    Get PDF
    In this work, we propose the use of the neural gas (NG), a neural network that uses an unsupervised Competitive Hebbian Learning (CHL) rule, to develop a reverse engineering process. This is a simple and accurate method to reconstruct objects from point clouds obtained from multiple overlapping views using low-cost sensors. In contrast to other methods that may need several stages that include downsampling, noise filtering and many other tasks, the NG automatically obtains the 3D model of the scanned objects. To demonstrate the validity of our proposal we tested our method with several models and performed a study of the neural network parameterization computing the quality of representation and also comparing results with other neural methods like growing neural gas and Kohonen maps or classical methods like Voxel Grid. We also reconstructed models acquired by low cost sensors that can be used in virtual and augmented reality environments for redesign or manipulation purposes. Since the NG algorithm has a strong computational cost we propose its acceleration. We have redesigned and implemented the NG learning algorithm to fit it onto Graphics Processing Units using CUDA. A speed-up of 180Ă— faster is obtained compared to the sequential CPU version.This work was partially funded by the Spanish Government DPI2013-40534-R grant

    Evolution of a double-front Rayleigh-Taylor system using a GPU-based high resolution thermal Lattice-Boltzmann model

    Full text link
    We study the turbulent evolution originated from a system subjected to a Rayleigh-Taylor instability with a double density at high resolution in a 2 dimensional geometry using a highly optimized thermal Lattice Boltzmann code for GPUs. The novelty of our investigation stems from the initial condition, given by the superposition of three layers with three different densities, leading to the development of two Rayleigh-Taylor fronts that expand upward and downward and collide in the middle of the cell. By using high resolution numerical data we highlight the effects induced by the collision of the two turbulent fronts in the long time asymptotic regime. We also provide details on the optimized Lattice-Boltzmann code that we have run on a cluster of GPU

    GPUMLib: Deep Learning SOM Library for Surface Reconstruction

    Get PDF
    The evolution of 3D scanning devices and innovation in computer processing power and storage capacity has sparked the revolution of producing big point-cloud datasets. This phenomenon has becoming an integral part of the sophisticated building design process especially in the era of 4th Industrial Revolution. The big point-cloud datasets have caused complexity in handling surface reconstruction and visualization since existing algorithms are not so readily available. In this context, the surface reconstruction intelligent algorithms need to be revolutionized to deal with big point-cloud datasets in tandem with the advancement of hardware processing power and storage capacity. In this study, we propose GPUMLib – deep learning library for self-organizing map (SOM-DLLib) to solve problems involving big point-cloud datasets from 3D scanning devices. The SOM-DLLib consists of multiple layers for reducing and optimizing those big point cloud datasets. The findings show the final objects are successfully reconstructed with optimized neighborhood representation and the performance becomes better as the size of point clouds increases

    XPySom: High-performance self-organizing maps

    Get PDF
    In this paper, we introduce XPySom, a new open-source Python implementation of the well-known Self-Organizing Maps (SOM) technique. It is designed to achieve high performance on a single node, exploiting widely available Python libraries for vector processing on multi-core CPUs and GP-GPUs. We present results from an extensive experimental evaluation of XPySom in comparison to widely used open-source SOM implementations, showing that it outperforms the other available alternatives. Indeed, our experimentation carried out using the Extended MNIST open data set shows a speed-up of about 7x and 100x when compared to the best open-source multi-core implementations we could find with multi-core and GP-GPU acceleration, respectively, achieving the same accuracy levels in terms of quantization error

    Distributed learning of CNNs on heterogeneous CPU/GPU architectures

    Get PDF
    Convolutional Neural Networks (CNNs) have shown to be powerful classification tools in tasks that range from check reading to medical diagnosis, reaching close to human perception, and in some cases surpassing it. However, the problems to solve are becoming larger and more complex, which translates to larger CNNs, leading to longer training times that not even the adoption of Graphics Processing Units (GPUs) could keep up to. This problem is partially solved by using more processing units and distributed training methods that are offered by several frameworks dedicated to neural network training. However, these techniques do not take full advantage of the possible parallelization offered by CNNs and the cooperative use of heterogeneous devices with different processing capabilities, clock speeds, memory size, among others. This paper presents a new method for the parallel training of CNNs that can be considered as a particular instantiation of model parallelism, where only the convolutional layer is distributed. In fact, the convolutions processed during training (forward and backward propagation included) represent from 6060-9090\% of global processing time. The paper analyzes the influence of network size, bandwidth, batch size, number of devices, including their processing capabilities, and other parameters. Results show that this technique is capable of diminishing the training time without affecting the classification performance for both CPUs and GPUs. For the CIFAR-10 dataset, using a CNN with two convolutional layers, and 500500 and 15001500 kernels, respectively, best speedups achieve 3.28Ă—3.28\times using four CPUs and 2.45Ă—2.45\times with three GPUs. Modern imaging datasets, larger and more complex than CIFAR-10 will certainly require more than 6060-9090\% of processing time calculating convolutions, and speedups will tend to increase accordingly
    • …
    corecore