Self-organising neural models have the ability to provide a good representation of the input space. In particular the Growing Neural Gas (GNG) is a suitable model because of its flexibility, rapid adaptation and excellent quality of representation.
works about computer vision and man-machine interaction like image compression [4] , segmentation and representation of objects [5, 6, 7] , objects tracking [8, 9, 10] , gestures recognition [11, 12, 13] , or 3D reconstruction [14, 15, 16] have been developed in the last years.
However, in many cases these applications present temporal constraints, that is why it is necessary to look for mechanisms that accelerate the learning process.
In order to accomplish the acceleration of the neural network learning algorithm, a redesign of the sequential algorithm executed onto the CPU to exploit the parallelism offered by the GPU has been carried out. Current GPUs have a large number of processors that can be used for general purpose computing. The GPU is specifically appropiate to solve computationally intensive problems that can be expressed as data parallel computations [17, 18] . However, implementation on GPU requires the redesign of the algorithms focused and adapted to its architecture. In addition, the programming of these devices has different restrictions such as the need for high occupancy in each processor in order to hide latencies produced by memory access, management and synchronization of different threads running simultaneously, the proper use of the hierarchy of memories, and other considerations. Researchers have already successfully applied GPU computing to problems that were traditionally addressed by the CPU [19, 20, 17] .
The GPU implementation used in this work is based on NVIDIA's CUDA architecture [21] , which is supported by most current NVIDIA graphics chips.
Supercomputers that currently lead the world ranking combine the use of a large number of CPUs with a high number of GPUs.
Neural networks have been used successfully in previous works to reduce the dimensionality of 3D input data maintaining a good topology preservation [15] , 3 [22] , [23] and [16] .
In particular, 3D scene reconstruction is a time consuming task, that is fundamental in most mobile robotic systems [24, 25, 26, 27, 28, 29, 30] . However, most of these works do not deal with real time restrictions.
To validate our work we have applied our GNG accelerated implementation to the extraction and model of features from 3D raw data [31, 32, 33, 34] . Moreover, using this method, apart from accelerating the routine, we achieve two other advantages: a complexity reduction (when comparing with raw data) and an improvement of speed-up without decreasing the quality of the representation obtained.
The rest of the paper is organized as follows: Section 2 describes basic concepts of GPGPU architecture and CUDA software. Section 3 provides a description of the topology learning algorithm of the GNG, and how the algorithm is fitted onto GPGPU architecture. Section 4 presents some experiments and results of the parallel implementation running onto a GPU compared with the single-threaded and multi-threaded CPU versions. Finally, section 5 presents a real application with time constraints to validate our implementation, followed by our main conclusions and future work.
GPGPU architecture
A CUDA compatible GPU is organized in a set of multiprocessors as shown in figure 1 [35] . These multiprocessors called Streaming Multiprocessors (SMs) are highly parallel at thread level. However, the number of multiprocessors varies depending on the generation of the GPU. Each SM consists of a series of Streaming Processors (SPs) that share the control logic and cache memory. Each of these 4 SPs can be launched in parallel with a huge amount of threads. For instance, the GT400 chip family supports up to 1024 threads per SM, with 480 SPs distributed between 15 SMs. The GT400 chip is capable of performing a computing power of 1,5 teraflops, launching a total of 15,360 threads simultaneosly. The current GPUs have up to 12 GBytes of DRAM, referenced in figure 1 as global memory.
The global memory is used and shared by all the multiprocessors, but it has a high latency. These threads are executed simultaneously working onto large data in parallel.
Each of them runs a copy of the kernel 1 on the GPU and uses local indexes to be identified.
Threads are grouped into blocks to be executed. Each of these blocks is allocated on a single multiprocessor, enabling the execution of several blocks within a 1 Piece of code that is executed on the GPU. 5 multiprocessor. The number of blocks that are executed depends on the resources available on the multiprocessor, scheduled by a system of priority queues. Within each of these blocks, the threads are grouped into sets of 32 units in order to carry out a fully parallel execution onto processors. Each set of 32 threads is called warp. In the architecture, there are certain restrictions on the maximum number of blocks, warps and threads on each multiprocessor, but it varies depending on the generation and model of graphic cards. In addition, these parameters are set for each execution of a kernel to get the maximum occupancy of hardware resources and obtain the best performance. Experiments section shows how to fit these parameters to execute our GPU implementation.
CUDA architecture has also a memory hierarchy. Different types of memory can be found: constant, texture, global, shared and local registries. The shared memory is useful to implement caches. Texture and constant memory are used to reduce the computational cost avoiding global memory access which has high latencies.
In the last years, a large number of applications have used GPUs to speed up the processing of neural networks algorithms [36, 37, 38, 39, 40, 41] applied to various computer vision problems such as the representation and tracking of objects in scenes [42] , face representation and tracking [43] or pose estimation [44] .
GNG implementation using GPUs
From the Neural Gas model [45] and Growing Cell Structures [46] , Fritzke developed the Growing Neural Gas model [3] , with no predefined topology of union between neurons, in which from an initial number of neurons, new ones are 6 added (figure 2). Modify error counters 10
Repeat λ times
Repeat until ending condition is fulfilled 11
Reconfiguration module
Insertion/ deletion module GNG learning algorithm has a high computational cost, we propose a method to accelerate it using GPUs and taking advantage of the many-core architecture provided by these devices, as well as their parallelism at the instruction level.
GPUs are a specialized hardware for computationally intensive high-level parallelism that uses a higher number of transistors to process data and less for flow control or management of the cache, unlike in CPUs. We have used the architecture and the programming tools (language, compiler, development environment, debugger, libraries, etc) provided by NVIDIA to exploit their hardware parallelism.
GNG Algorithm
GNG is an unsupervised incremental clustering algorithm that given some in- The graph generated by CHL creates an "induced Delaunay triangulation" that is a sub-graph of the Delaunay triangulation corresponding to the set of nodes. The induced Delaunay triangulation optimally preserves the topology in a very general sense [47] . CHL is an essential component of the GNG algorithm since it is used to process the local adaptation of nodes and insertion of new ones.
The network is specified as:
• A set N of nodes (neurons). Each neuron c ∈ N has its associated reference vector w c ∈ R n . The reference vectors can be regarded as positions in the input space of their corresponding neurons.
• A set of edges (connections) between pairs of neurons. These connections are not weighted, and its purpose is to define the topological structure. An edge aging scheme is used to remove connections that are invalid due to the motion of the neurons during the adaptation process.
GNG uses parameters that are constant with time. Furthermore, it is not necessary to decide a priori the number of nodes to use since nodes are added incrementally during execution. Insertion of new nodes stops when an user defined performance criteria is fulfilled or when a maximum network size has been reached.
The adaptation of the network to the input space vectors is produced in step 6. The insertion of connections (step 4) between the winning neuron and the second closest to the input signal provides the topological relationship between the neurons (figure 2).
The elimination of connections (step 8) removes the edges that are no longer part of that topology. This is done by removing the connections between neurons that are no longer near or that have other neurons located closer, so that the age of these connections exceeds a threshold.
The accumulation of the error (step 5) can identify those areas of the input space of vectors where it is necessary to increase the number of neurons to improve the mapping.
Estimating the upper bound of acceleration factor
After presenting the different stages of the algorithm, and before tackling the parallel implementation of these stages, it is necessary to know what percentage of instructions is executed at each step in respect to the total number. In order to achieve this, we use a profiler so that depending on the values of the parameters 9 with which we have adjusted the algorithm (number of neurons and number of input patterns) we obtain the percentage of instructions executed at each stage.
It can be seen that most of the execution time of the algorithm is consumed in the winning neurons search stage, which also calculates the Euclidean distances. Once this information has been obtained we apply different metrics of parallel computing to estimate which would be the overall maximum acceleration that we can obtain assuming that we can accelerate these stages by a factor S. The metrics used are widely known: Amdahl's Law [48] and other performance metrics of parallel computing [49, 50, 51] .
In particular, we focus our study on the modern version of Amdahl's Law, which states that if a fraction f is accelerated by a factor S, the overall acceleration is:
This is the equation that better estimates the theoretical maximum acceleration that can be obtained using parallel implementations on GPUs as it applies the achieved improvement on a fraction of the code instead of applying it to the total number of cores. The number of cores can be used to measure the acceleration in the case of the execution onto a single GPU core, but in our case and in most of the cases, the acceleration that we get is related to the execution onto one CPU core. So S is defined as the speed-up obtained in respect to a fraction of the CPU code.
As shown in table 2, applying the Amdahl's law, we can estimate the maximum acceleration we could get in the algorithm after accelerating a fraction of the algorithm by a factor S. Other implicit latencies exist in the architecture that will be discussed in the following sections. The acceleration of the winning neurons search and Euclidean distance stages offers the highest overall acceleration.
In the experiments section, real values for speed-up of winning neuron search stage will be obtained. Then we will apply Amdahl's law again to compare theoretical values with real overall speed-up obtained using the GNG algorithm.
Thereby we will be able to measure how much time is consumed by other latencies like data transfers or device initialization and what is the speed-up upper bound for GNG algorithm.
GPU Implementation
In order to accelerate the GNG algorithm on GPUs using CUDA, it is necessary to redesign it so that it fits within the GPU architecture. Many of the oper- of speed-up in respect to a fraction p of the algorithm ations performed in the GNG algorithm can be parallelized because they act on all the neurons of the network simultaneously. That is possible because there is no direct dependence between neurons at the operational level. However, there exists a dependence in the adjustment of the network, which makes necessary the synchronization of various parallel execution operations each iteration. Figure 2 describes GNG algorithm steps that have been accelerated onto the GPU using kernels.
Euclidean distance calculation
The first stage of the algorithm that has been accelerated is the calculation of Euclidean distances performed at each iteration. This stage calculates the Euclidean distance between a random pattern and each of the neurons. This task may take place in parallel by running the calculation of each neuron distance onto as many threads as neurons the network contains. It is possible to calculate more than one distance per thread, but this is efficient only for large vectors where the number of blocks executed on the GPU is also very high.
Parallel reduction
The second task parallelized was the search of the winning neuron: the one with the lowest Euclidean distance to the pattern generated and the second closest.
For this search, we use a parallel reduction technique described in [52] . This technique accelerates parallel operations such as the search for the minimum in large data sets. For our work, we modified the original algorithm, so that with a single reduction we not only obtained the minimum, but also the two smallest values of the entire data set. This new version has been called 2M inP arallelReduction. Figure 3 shows how Parallel Reduction can be described as a binary tree where at the end of the log 2 (n) steps we obtain the final result of the operation onto a set of N elements.
Complexity
To perform the calculation of the complexity of this approach comparing it with the sequential version that has a complexity of O(N ), it should be noted that, in parallel processing, we identified three types of complexity: complexity in the number of execution steps, complexity of the work performed and time complexity. These complexities in the parallel reduction algorithm are:
• The execution steps complexity is O(log 2 (N )) since it is necessary to perform log 2 (N ) iterations to reach the final result. Also within each execution step s, N 2 s operations are performed.
• The complexity of the work performed is:
Parallel Reduction: sequential addressing
Step 1
Step 2
Step 3
Step 4 where S is the total number of steps.
• The time complexity is O(N/P + log 2 (N )), where P is the number of processors.
Therefore, since in each block t threads are executed, each of them processing each element of the set N , we have a number of threads equal to the number of elements. Considering this to calculate the time complexity of each of these threads as processors, the time complexity is reduced to log 2 (N ) in respect to the complexity O(N ) in the sequential version.
Despite this difference in complexity between the parallel and the sequential versions, the preparation and execution of programs on the GPU involves a time penalty, as well as the GPU memory transfer of data processing that causes a new penalty that begins to be compensated from a number X of elements to be processed. This issue also affects the cost of the operation to be performed onto the data.
Other optimizations
To speed-up the remaining steps we have followed the same strategy used during the first phase. Each thread is responsible in different cases to perform an operation on a neuron: check edges connections age and in the case a certain threshold was exceded delete them, update local error of the neuron or adjust neuron weights. At the stage of finding the neuron with maximum error the strategy followed was the same as the one used in finding the winning neuron, but in this case the reduction is looking only for the neuron with the highest error.
Regardless of the parallelism of the algorithm, we have followed some good practices on the CUDA architecture to get more performance. First, the use of the constant memory to store the neural network parameters w , n , α, γ, a max . By storing these parameters in this memory, the access is faster than working with values stored in the global memory.
Our GNG implementation onto GPU architecture is also limited by the memory bandwidth available. In the experiments section we show specification reports for each CUDA capable device used and its memory bandwidth. However, this bandwidth is only achievable under highly idealized memory access patterns. It does, however, provide us with an upper limit of memory performance. Nevertheless, some memory access patterns, like moving data from the global memory into shared memories and registers, provide better coalesced access. The shared memory within each multiprocessor has been used to get the highest advantage of memory bandwidth. So that, it acts as a cache to avoid frequent access to global memory in operations with neurons and allows the threads to achieve coalesced reads when accessing neurons data.
For instance, a GNG network composed of 20,000 neurons and auxiliary structures requires only 17 megabytes. Therefore, GPU implementation in terms of size does not present problems as current GPU devices have enough memory to store it.
Minimizing transfers approach
Memory transfers between CPU and GPU are the main bottleneck to obtain speed-up, so these transfers have been avoided as much as possible. Initial versions of the algorithm failed to obtain performance over the CPU version because the complete neural network was copied from GPU memory to CPU memory and vice versa for each input pattern generated. This penalty, introduced due to the bottleneck of the transfer through the PCI-Express bus, was so high that we did not improve CPU version. After careful consideration of the flow of execution, we decided to move the inner loop of pattern generation to the GPU, although some tasks are not parallelizable and have to be run on a single GPU thread.
The workflow of our first approach using CUDA is shown in figure 4 . First, GNG network is created in the CPU and CUDA device is initialized. Then, the necessary space is allocated in the GPU memory to perform processing. Once the GNG network structure has been copied to the GPU memory, the learning algorithm begins: first, a random input pattern is generated and the Euclidean distance is calculated from each of the neurons. Second, these distances are calculated in parallel taking advantage of the massively parallel computing on the GPU and, the two neurons with the lowest distance (winning neurons) are also obtained using a parallel reduction. Then, the indexes of winning neurons are copied to the CPU memory and adjustment is performed (sequential). This step is repeated λ times.
This first approach did not obtain improvement regarding of the CPU version because the entire neural network were copied from the CPU memory to the GPU λ times at each new neuron insertion, including significant latencies. Figure 5 shows the approach used to avoid this large number of transfers between GPU and CPU memory. The inner loop has been moved to the GPU, so it is not necessary to copy the network structure back to the memory of the CPU and make the adjustment. In this case, the adjustment is performed in a single thread onto the GPU because the task set is sequential and can not be parallelized. Performing this task in a single thread in the GPU is better due to the high latency of transfers between GPU and CPU. It is also clear that reducing the number of memory transactions from the device memory results in a significant increase of the processing throughput. Figure 6 shows that, for 500 patterns, the percentage of time spent in the execution of the algorithm for memory transfers between CPU and GPU is drastically reduced. Thus we can increase the number of input patterns without increasing the number of transfers between memories.
The use of CUDA in this algorithm provides better performance for a large number of neurons due to the time needed to prepare some specific guidelines for the architecture implementation as kernels execution or GPU memory allocation. Performing these operations on small vectors of 50-500 neurons is almost immediate on the CPU, while the GPU cannot hide these inherent latencies in the architecture if a large number of neurons is not reached. Therefore, we considered the idea of applying hybrid techniques according to the restriction that the GNG is an incremental network that initially works with a small number of neurons, which grows progressively. This hybrid technique begins by running the GNG onto the CPU, but when it is detected that the runtime of the sequential version is higher than the runtime of the parallelized one, the network is copied to GPU memory and the remaining calculation is performed onto the GPU.
Experiments
The accelerated version of GNG algorithm has been developed and tested on a machine with an Intel Core i3 540 3.07Ghz and different CUDA capable devices. Table 3 shows different models that we have used and their features.
19
The multi-core CPU implementation of the GNG algorithm has been developed using Intel Threading Building Blocks (TBB) library [53] , taking advantage of the multi-core processor capabilities and avoiding the existing overhead [54] .
The number of threads used in the multi-core CPU implementation is the maximum defined in the specifications of Intel i3 540 processor. 
Number of threads per block
As mentioned in section 2, threads are organized into blocks to carry out their execution onto multiprocessors. Depending on the application developed, a different number of threads per block should be used to obtain the best performance. We tested different kernels running on the NVIDIA GTX 480 with different numbers of threads per block. The better performance is obtained when using a number of 20 threads between 128 and 256 to perform this test ( figure 7) . This is because using these parameters, we obtain maximum occupancy of CUDA multiprocessors.
These results are directly applied to the other cards since the number of threads per block depends on the kind of application designed. 
Speed-up 2 Min Parallel Reduction
We have made some experiments of 2Min Parallel Reduction implementation with different graphics boards using 256 threads per block configuration for kernels launch. We obtained a speed-up factor up to 43x faster regarding a single-core CPU and 40x faster regarding multi-core CPU, in the task of taking adjustments of the network with a number of neurons close to 100k. As we can see in figure 8 (bottom), the speed-up factor depends on the device on which we execute the algorithm and the number of cores it has. Figure 8 If we apply these values to Amdahl's law that we previously analyzed, replacing speed-up factor of fraction f by these values, we can estimate what will be the upper limit of speed-up we can obtain for applying the whole GNG algorithm.
Thereby, the current acceleration will be compared with the calculated upper limit and we can extract what percentage of time is consumed by other latencies implied in the CUDA architecture. 
2D representation and 3D reconstruction
To test our parallel version of the GNG algorithm, we have done experiments using the GNG for 2D representation and 3D reconstruction. To solve the problem of 3D reconstruction, the number of neurons necessary to adapt the input space is 23 high which benefits the use of the GPU. Therefore, an increase in speed over the CPU version can be achieved.
Based on a previous work [55] , it has been chosen a number of neurons N of 1000 / 5000 / 10000 / 200000 and a number of input patterns λ of 500/1000. Other parameters have been also fixed based on our previous experience: w = 0.1, n = 0.001, α = 0.5, γ = 0.95, a max = 250.
GNG learning speed-up factor Figure 9 shows an experiment using GNG to reconstruct a 3D object with 20000 neurons and 1000 input patterns where CPU solution takes more and more time as the number of neurons in the network grows. However, the parallel CUDA version increases the size of the array of neurons without degrading significantly the performance. For a number of 20k neurons, we obtain 6x speed-up factor using a NVIDIA GTX 480 GPU. This speed-up is lower than the theoretical overall speed-up that we estimated in the previous section. This is due to the implicit latencies of the GPU architecture. Table 5 shows differences between the theoretical overall speed-up and the obtained overall speed-up. Table 5 : Theoretical overall speed-up and obtained overall speed-up using the device GTX480.
In figure 9 , it can also be appreciated that the CPU version is faster during the first iterations, so a hybrid version would be faster than separate CPU and GPU versions. Multi-core CPU implementation is also slower during the first iterations compared with single-core CPU due to the existing overhead caused by the management of threads and by the subdivision of the problem. 
GNG hybrid version
As we discussed in the previous experiments, GPU version has low performance in the first iterations of the learning algorithm, where the GPU can not hide the latencies due to the small number of processing elements. To achieve even bigger acceleration of the GNG algorithm, we propose the use of the CPU in the first iterations of the algorithm, and then start processing data in the GPU only when there is an acceleration regarding CPU, thus achieving a bigger overall acceleration of the algorithm (see figure 10 ). To determine the number of neurons necessary to start computing at GPU we have analyzed in detail the execution times for each new insertion, and concluded that each device, depending on its computing power starts being efficient at a different number of neurons.
After several tests, we have determined the threshold at which each device starts accelerating regarding the CPU version. As it can be seen in figure 9 (top), threshold values for different devices are set to 1500, 1700, 2100 for GTX 480, Tesla 
Rate of adjustments per second
We have performed several experiments that show how the accelerated version of the GNG is not only capable of learning faster than CPU, bu also obtains more adjustments per second than the single-threaded and multi-threaded CPU implementations. For instance, after learning a network of 20000 neurons we can perform 17 adjustments per second using the GPU while the single-core CPU gets 2.8 adjustments per second and the multi-core CPU gets 8 adjustments per second.
This means that GPU implementation can obtain a good topological representation with time constraints. Figure 11 shows the different adjustments rates per second performed by different GPU devices compared to CPU. It is also shown that when increasing the number of neurons in the CPU, it can not handle a high rate of adjustments per second.
Discussion
From the experiments described above we can conclude that the number of threads per block that best fits in our implementation is 256 due to the following reasons: First, the amount of computation the algorithm performs in parallel. Second, the number of resources that each device has and finally the use that we have made of shared memories and registries. It is also demonstrated that in comparison to CPU implementation, the 2M inP arallelReduction achieves a speed-up of more than 40x to find out a neuron at a minimum distance to the generated input Experiments on the complete GNG algorithm showed that using the GPU, small networks under-utilize the device, since only one or a few multiprocessors are used. Our implementation has a better performance for large networks than for small ones. To get better results for small networks we propose a hybrid implementation. These results show that GNG learning with the proposed hybrid implementation achieves a speed-up 6 times higher than the single threaded CPU implementation.
Finally, it is shown how our GPU implementation can process up to 17 adjustments of the network per second while single threaded CPU implementation only can manage 2.8, getting a speed-up factor of more than 6 times.
Accelerating 6DoF egomotion using GNG
In this section, we show an application where the use of the accelerated GNG improves its solution. The main goal of this application is to perform six degrees of freedom (6DoF) pose registration in semi-structured environments, i.e., manmade indoor and outdoor environments. This registration can provide a good starting point for Simultaneous Location and Mapping (SLAM). We use the method proposed in [31] . This method is developed for managing 3D point sets collected by any kind of sensor. For our experiments, we have used data from an infrared time-of-flight camera SR4000, but in [31] there are examples of this method applied to other 3D devices, like a sweeping unit with a 2D laser Sick and a Digiclops stereo camera, mounted on a mobile robot. We are also interested in dealing with outliers, i.e., environments with people or non-modeled objects. This task is hard to overcome as classic algorithms, like ICP and its variants, are very sensitive to outliers. Furthermore, we do not use odometry information. Finally, the huge amount of data makes necessary the acceleration of the overall process in order to obtain the results in real time.
We briefly describe the method proposed in [31] to manage 3D data and to use it for 6DoF egomotion calculation. GNG produces a Delaunay Triangulation which can be used as a representation of the points neighbourhood. GNG can be 29 applied directly to 3D data. Figure 12 shows the result of applying GNG to 3D points from a SR4000. On the other hand, in [31] a feature extraction process is applied to the raw 3D data in order to obtain a complexity reduction. These features are planar patches which are models representing surfaces from the 3D data. This feature extraction method is based on neighbour searching. We can improve and accelerate the neighbour searching using the GNG structure as it produces a more detailed and accurate planar patches descriptions. Figure 13 shows planar patches extraction from a 3D image obtained by a SR4000 camera. The right image shows the results of combining GNG with the features extraction procedure. It can be compared with the left image in which no GNG has been used. The more number of planar patches we have, the more accurate result we obtain. For this reason, we would like to use these models to achieve further mobile robot applications in real 3D environments. The basic idea is to take advantage of the extra knowledge that can be found in 3D models such as surfaces and its orientations. This information is introduced in a modified version of an ICP-like algorithm in order to reduce the outliers incidence in the results. ICP [56] is widely used for geometric alignment of a pair of three-dimensional points sets.
From an initial approximate transformation, ICP iterates the next three steps until convergence is achieved: first, closest points between sets are stated; then, best fitting transformation is computed from paired points; finally, transformation is applied. In mobile robotics area, the initial transformation usually comes from odometry data.
Nevertheless, our approach does not need an initial approximate transformation as ICP based methods do. We can use the global model structure to recover the correct transformation. This feature is useful for those situations where no odometry is available, or it is not accurate enough, such as legged robots. In our case, we exploit both the information given by the normal vector of the planar patches and its geometric position. Whereas original ICP computes both orientation and position at each iteration of the algorithm, we can take an advantage of the knowledge about planar patches orientation for decoupling the computation of rotation and translation. We first register the orientation of planar patches sets and when the two planar patches sets are aligned we address the translation registration. In figure 14 , we show an example of 3D map building using this 6DoF egomo-32 tion approach. For this experiment, 100 3D images from a 5 meter range SR4000 camera were used. The image on the left shows a 3D view of the reconstructed environment using 6DoF egomotion from planar patches. In the right image, the same scene is reconstructed but GNG was used to improve feature extraction.
While in the first experiment the registration of the sequence was almost impossible, in the second one the reconstruction was reasonably good. Computing time for obtaining planar patches descriptions after applying GNG is almost the same as without GNG and is about 300 ms per image. Application of GPU acceleration provides a lower reconstruction time per each data acquisition, 50 ms for an adjustment of a neural network composed by 20.000 neurons and 1000 λ input patterns as it can be seen in figure 11 . This makes our system suitable to deal with time constraints.
Conclusions and future work
This paper proposes the modification and acceleration of the GNG algorithm in order to obtain a more efficient version suitable for operations with time constraints. As demonstrated in the experiments, the runtime of sequential GNG algorithm grows with the number of neurons as the network increases. In contrast, in the parallel version implemented onto GPU architecture, as we increase the number of neurons, we obtain a greater acceleration over the sequential version. Experimental results show that the GPU implementation significantly reduces learning time compared with single-threaded and multi-threaded CPU implementations for GNG.
GNG algorithm can be accelerated using the GPU and allows better performance than the CPU implementations. It has also been demonstrated how 3D scene reconstruction for mobile robotics can be accelerated using GPUs in order to deal with time constraints.
The parallel solution implemented on GPU can be still improved carefully analyzing all aspects offered by the CUDA architecture and making a better use of them: multiprocessors occupancy, memory hierarchy use, transfer between CPU and GPU memory, and other.
Further work will include other improvements on the GPU implementation:
generating random patterns using GPU and using multi-GPU computation to improve performance and to manage several neural networks learning different features simultaneously. More applications of the accelerated GNG will be studied in the future.
