23 research outputs found
Write out process.
<p>The thread id indicates (based on the computation _popc(B)) whether a given thread is writing out an element less than the pivot or greater than or equal to the pivot. The values and , which are maintained in shared memory and updated incrementally, indicate the location in the global array of the location of the last element that is less than the pivot and greater than or equal to the pivot. This operation involves at most two coalesced memory writes.</p
Data partitioning for distributed parallel execution.
<p>The squared distance matrix is split into partitions where is an integral multiple of , the number of nodes. The computations of the partitions and the subsequent computations of the -NNs are distributed between different nodes.</p
Finding vector norms.
<p>Each thread block is assigned to compute the norm of one vector in . Each thread strides through the vector and computes the sum , where is the number of threads in a thread block. Finally, an atomic add operation is used to add all the sums within each thread into a location in global memory on the GPU.</p
Processing local -NNs within nodes.
<p>The sub-problem assigned to a node is finding the row and column -NNs w.r.t . is divided into partitions. All partitions are processed by GPU . The row -NNs are processed within GPU memory and the merged results are written to CPU RAM. The column -NNs are written to CPU RAM. Later, each of the local column -NNs are merged by a single GPU.</p
Benchmark against work by Alabi et al. [33].
<p>Here the benchmark is comparing the speedup of selecting element vs selection of -NN using our algorithm. These tests were performed for elements with values alone. We kept the product and varied as . Simultaneously, we varied as . There is a dramatic fall in performance gain because the method by Alabi increases GPU utilization as increases. It reaches saturation at <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0092409#pone.0092409-Alabi1" target="_blank">[33]</a>. Our algorithm starts underutilizing GPU resources at . Even when Alabi et al. is saturated and our algorithm is significantly underutilizing GPU resources, we are faster. Once again, with a larger GPU RAM, our algorithm would perform significantly better.</p
Performance benchmarks for varying .
<p>In this test our input data have the dimension and the number of input objects/vectors . <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0074113#pone-0074113-g007" target="_blank">Figure 7(a)</a> shows the performance vs. <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0074113#pone.0074113-Garcia1" target="_blank">[23]</a>. <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0074113#pone-0074113-g007" target="_blank">Figure 7(b)</a> shows the performance vs. <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0074113#pone.0074113-Arefin1" target="_blank">[24]</a>. As increases, the total performance gain asymptotically approaches that of matrix multiplication because for large , this computation dominates.</p
Node assignments for processing partitions of .
<p>Due to symmetry, . Therefore, only have to be computed. is processed by node where .</p
Summation kernel.
<p>Calculation of every row of involves and one element of per row. Therefore, each thread loads an element of into a register. These data are reused to compute all rows of . Next, one thread per block reads the corresponding element of into shared memory. Next, each thread reads an element of and adds to it the element of , which is in the register and the into shared memory to generate the corresponding element of .</p
Performance benchmarks for multi-GPU execution.
<p>In this test we used 2<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0074113#pone.0074113-Arefin1" target="_blank">[24]</a> the 2 GPUs (Tesla 2050) were mounted on a single desktop machine. For our implementation, we use 2 nodes in our GPU cluster and opted to use only one GPU per node. The input data had the dimension , and the number of closest neighbors .</p
Pivot process.
<p>The pivot process is accomplished in shared memory. Each thread determines where in the shared memory the value has to be written. Values less than the pivot are accumulated on the left hand side and values greater than or equal to the pivot are accumulated on the right hand side. Since all threads write to different locations, there are no bank conflicts.</p