Search CORE

23 research outputs found

Write out process.

Author: Ali Dashti (461880)
Ivan Komarov (120593)
Roshan M. D'Souza (120596)
Publication venue
Publication date
Field of study

The thread id indicates (based on the computation _popc(B)) whether a given thread is writing out an element less than the pivot or greater than or equal to the pivot. The values and , which are maintained in shared memory and updated incrementally, indicate the location in the global array of the location of the last element that is less than the pivot and greater than or equal to the pivot. This operation involves at most two coalesced memory writes.</p

FigShare

Data partitioning for distributed parallel execution.

Author: Ali Dashti (461880)
Ivan Komarov (120593)
Roshan M. D’Souza (159250)
Publication venue
Publication date
Field of study

The squared distance matrix is split into partitions where is an integral multiple of , the number of nodes. The computations of the partitions and the subsequent computations of the -NNs are distributed between different nodes.</p

FigShare

Finding vector norms.

Author: Ali Dashti (461880)
Ivan Komarov (120593)
Roshan M. D’Souza (159250)
Publication venue
Publication date
Field of study

Each thread block is assigned to compute the norm of one vector in . Each thread strides through the vector and computes the sum , where is the number of threads in a thread block. Finally, an atomic add operation is used to add all the sums within each thread into a location in global memory on the GPU.</p

FigShare

Processing local -NNs within nodes.

Author: Ali Dashti (461880)
Ivan Komarov (120593)
Roshan M. D’Souza (159250)
Publication venue
Publication date
Field of study

The sub-problem assigned to a node is finding the row and column -NNs w.r.t . is divided into partitions. All partitions are processed by GPU . The row -NNs are processed within GPU memory and the merged results are written to CPU RAM. The column -NNs are written to CPU RAM. Later, each of the local column -NNs are merged by a single GPU.</p

FigShare

Benchmark against work by Alabi et al. [33].

Author: Ali Dashti (461880)
Ivan Komarov (120593)
Roshan M. D'Souza (120596)
Publication venue
Publication date
Field of study

Here the benchmark is comparing the speedup of selecting element vs selection of -NN using our algorithm. These tests were performed for elements with values alone. We kept the product and varied as . Simultaneously, we varied as . There is a dramatic fall in performance gain because the method by Alabi increases GPU utilization as increases. It reaches saturation at <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0092409#pone.0092409-Alabi1" target="_blank">[33]</a>. Our algorithm starts underutilizing GPU resources at . Even when Alabi et al. is saturated and our algorithm is significantly underutilizing GPU resources, we are faster. Once again, with a larger GPU RAM, our algorithm would perform significantly better.</p

FigShare

Performance benchmarks for varying .

Author: Ali Dashti (461880)
Ivan Komarov (120593)
Roshan M. D’Souza (159250)
Publication venue
Publication date
Field of study

In this test our input data have the dimension and the number of input objects/vectors . <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0074113#pone-0074113-g007" target="_blank">Figure 7(a)</a> shows the performance vs. <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0074113#pone.0074113-Garcia1" target="_blank">[23]</a>. <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0074113#pone-0074113-g007" target="_blank">Figure 7(b)</a> shows the performance vs. <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0074113#pone.0074113-Arefin1" target="_blank">[24]</a>. As increases, the total performance gain asymptotically approaches that of matrix multiplication because for large , this computation dominates.</p

FigShare

Node assignments for processing partitions of .

Author: Ali Dashti (461880)
Ivan Komarov (120593)
Roshan M. D’Souza (159250)
Publication venue
Publication date
Field of study

Due to symmetry, . Therefore, only have to be computed. is processed by node where .</p

FigShare

Summation kernel.

Author: Ali Dashti (461880)
Ivan Komarov (120593)
Roshan M. D’Souza (159250)
Publication venue
Publication date
Field of study

Calculation of every row of involves and one element of per row. Therefore, each thread loads an element of into a register. These data are reused to compute all rows of . Next, one thread per block reads the corresponding element of into shared memory. Next, each thread reads an element of and adds to it the element of , which is in the register and the into shared memory to generate the corresponding element of .</p

FigShare

Performance benchmarks for multi-GPU execution.

Author: Ali Dashti (461880)
Ivan Komarov (120593)
Roshan M. D’Souza (159250)
Publication venue
Publication date
Field of study

In this test we used 2<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0074113#pone.0074113-Arefin1" target="_blank">[24]</a> the 2 GPUs (Tesla 2050) were mounted on a single desktop machine. For our implementation, we use 2 nodes in our GPU cluster and opted to use only one GPU per node. The input data had the dimension , and the number of closest neighbors .</p

FigShare

Pivot process.

Author: Ali Dashti (461880)
Ivan Komarov (120593)
Roshan M. D'Souza (120596)
Publication venue
Publication date
Field of study

The pivot process is accomplished in shared memory. Each thread determines where in the shared memory the value has to be written. Values less than the pivot are accumulated on the left hand side and values greater than or equal to the pivot are accumulated on the right hand side. Since all threads write to different locations, there are no bank conflicts.</p

FigShare