Abstract. In a recommendation system, customers' preferences are encoded into vectors, and finding the nearest vectors to each vector is an essential part. We define this part of problem as a k-nearest vector problem and give an effective algorithm to solve it on multiple graphics processor units (GPUs). By an experiment, we show that when the size of the problem is large, an implementation of the algorithm on two GPUs runs more than 260 times faster than a single core implementation on a latest CPU. We also show that our algorithm scales well with respect to the number of GPUs.
Introduction
A recommendation system utilizes known customers' preferences to predict unknown preferences. It is widely used in Internet-based retail shops and other service providers, such as Amazon.com [2] and Netflix [3] for example. In a recommendation system, customers' preferences or buying patterns for items are encoded into vectors and finding nearest vectors is an essential part of its computation.
Generally a recommendation system deals with large samples and large dimensions, as person×item for example. In such a case, the dimensionality reduction method such as singular valued decomposition or latent Dirichlet allocation has been widely used [4] [5] [6] . As the result of the reduction, the problem becomes the k-nearest vector search for a moderate dimension. However, the effect of the sample size n is O(n 2 ) and it is a computational burden. Therefore some approximation has been considered to be necessary [7] . This paper indicates that strict computation in practical time is possible.
The k nearest vector problem is defined as follows: when a set of vectors v 1 . . . v n ∈ R d , distance function δ and an integer k is given, find k nearest vectors to each v i . We propose an effective and scalable algorithm to solve it on multiple Graphics Processor Units (GPUs). Our algorithm is implemented in CUDA [1] , which is extension of C language provided by NVIDIA.
A graphics processor unit is a powerful commodity processor. Although a GPU is originally designed for processing of graphics, the movement of the GPGPU (General Purpose computing on GPU) has arisen as an expected breakthrough for a large scale numerical computation. The typical characteristic of the GPGPU is highly massive parallelism. To extract its power, it is necessary to run tens of thousand of threads per unit. Because of that property, it consumes large energy as a unit, but it is very energy effective per FLOPS.
As is seen by looking at the NVIDIA's show case [1] of applications of CUDA, GPGPU has been mainly applied to scientific applications, such as physics, meteorology, cosmology, molecular dynamics, etc. This paper shows GPGPU is also applicable for marketing. That indicates that the GPGPU can yield a direct value in business. Actually a GPU can be a good option for companies which need customer analysis for a lower cost. Because of its energy effectiveness and low hardware price per unit, its starting and running cost is lower than a cluster of CPUs.
However, of course, a GPU has some week points if compared to a CPU. The main difficulty lies on its less generality. Implementing an effective software on GPU is generally much more difficult than on CPU. Especially, the limited size of cache memory (or shared memory in terminology of CUDA) and the limited way of sharing data between threads often becomes a problem. A specific pattern of memory access (called coalesced memory access) to achieve high bandwidth is also problematic.
The algorithm of the k-nearest vector problem is like a mixture of N -body problem and sorting. Nyland et al. [8] showed an effective algorithm for N -body problem on CUDA. Because dealing with high dimensional vectors, we give some trick in addition to the known N -body algorithm. About sorting, [9] showed an effective algorithm, but we have employed another algorithm because we have to sort many arrays at once and just taking top k elements does not need full sort of the whole data.
The rest of this paper is organized as follows. In Sect. 2, outline of CUDA's programming model is explained. In Sect. 3, we define the problem formally. We give overview of the algorithm in Sect. 4. In Sect. 5 and 6, we explain the detail of each step of the algorithm. In Sect. 7, we show the result of experiment. We conclude in Sect. 8
Programming model of CUDA
In this section, programming model of CUDA is briefly explained. For more details of CUDA, refer to [10] .
Thread model. NVIDIA's recent graphics processor contains hundreds of stream processors (SP). The SP is like a core in a CPU; they can compute simultaneously. For example, GTX280 has 240 SPs. With such many SPs and very low cost of context switch, a GPU performs well for tens of thousands of threads. Threads are divided into thread blocks. Each thread block can contain at most 1024 threads. A function to synchronize threads in a block is provided, while there is no such function to synchronize thread blocks. The only way to synchronize thread blocks is to bring back the control to the CPU.
Hierarchal memories. Before running a GPU, the CPU must explicitly copy data to the GPU's memory. The memory in GPU to share the data with CPU is called global memory. A thread block is also a unit to share data. Each thread block has a memory to share only in the thread block. It is called shared memory. The access to the global memory is relatively slow, and usually copying necessary data to shared memory is better for performance. Although global memory has some gigabytes, shared memory has only 16KB for each thread block. Each thread also has a local memory which is called a register. The access to a register is fast, but its size is also limited.
Coalesced memory access. In CUDA, for example, if successive 16 threads are accessing the successive 128 bytes in global memory at the same time, the memory access is coalesced. When a memory access is coalesced, it is done in only one fetch while otherwise access by 16 threads takes 16 fetches. Hence, effective utilization of coalesced memory access affects very much the total performance of an application. The detailed condition about when memory access can be coalesced is explained in [10] .
Description of the problem
The k-nearest vector problem is described as follows.
Suppose that a set of vectors
Then output the k nearest vectors to each v i . In other words, for each i, find a subset of indices {j i1 , . . . , j ik } ⊂ {1, 2, . . . , n} such that
The distance function δ is arbitrary. Although we use the word "distance", it does not necessarily need to satisfy the axiom of distance. We assume that δ is cumulatively computable. It means δ can be computed step by step by referring to each coordinate values. In other words, it is computed with a function δ : R × R × R → R and some initial value a 1 by a i =δ(u i−1 , v i−1 , a i−1 ) and δ(u, v) = a n+1 . This assumption is not so artificial. Actually famous distances satisfy this condition; Euclidean distance, Hellinger distance, and KullbackLeibler divergence satisfy it for example.
In this paper, we only discuss the case when δ is symmetric: i.e. δ(u, v) = δ(v, u). In the symmetric case, we can omit the half of distance calculations, and balancing the workload becomes more difficult. The algorithm explained in this paper is easily modified for non-symmetric distance function.
We assume that δ is symmetric, and we only compute δ(v x , v y ) for x > y. For an explanation, we depict the whole problem as a square where the point (x, y) stands for the computation of δ(v x , v y ). The distances to compute is represented by upper right triangle of the square.
Because of the limitation of the number of threads which can be run at once on GPU, the problem is divided into a first level blocks. We call each of them a grid (Fig. 1) . Each grid is processed in a GPU at once. A grid can be divided row-wisely into blocks, each of which is computed in a thread block. We denote the size of each side of a grid by GSIZE. It means the grid (X, Y ) stands for the region GSIZE · X ≤ x < GSIZE · (X + 1), GSIZE · Y ≤ y < GSIZE · (Y + 1). Similarly we denote the size of a block (i.e. the number of rows in a block) by BSIZE (Fig. 2) . GSIZE is determined depending on n so that the problem can be devided effectively, while BSIZE is fixed according to the capability of CUDA.
Fig. 1. First level division of the problem

Fig. 2. Devision of a grid
To balance the workload, we assign GPUs as in Fig. 3 . In other words, the i-th row of grids is assigned to j-th GPU when i mod 2 · nDevices = j or i mod 2 · nDevices = 2 · nDevices − j − 1, where nDevices means the number of GPUs available. Here note that although it is enough to compute the upper-right part of the problem, each GPU virtually compute the mirror side of the assigned part (see also Fig. 4) To keep the k-nearest vectors, we use a heap structure. The heap has at most k elements and is in descending order, so that the k-th smallest element can be found in O(1). Moreover, each GPU keeps their own heaps to avoid a costly synchronization (Fig. 4) . It means each GPU has n heaps which stores the k nearest elements computed by itself. At the last phase, the heaps of different GPUs are merged in CPU.
Thus the outline of the algorithm is shown in Fig. 5 . Line 7 of this algorithm is explained in Sect. 5, and line 8 in Sect. 6 
4:
for Y := 0 to nGrids − 1 do 5:
for X := 0 to nGrids-1 do 6:
if i mod 2 · nDevices = j or i mod 2 · nDevices = 2 · nDevices − j − 1 then 7:
Calculate the distances for the grid (X, Y ) 8:
Push the i-th row of distances to hi for the grids (X, Y ) and (Y, X) 9:
end if 10:
end for 11:
end for 12: end procedure
Phase 1: calculation of distances
Basically the framework of the process to compute the distances of vectors is the same as the algorithm of N -body problem written in [8] . A grid is row-wisely devided into blocks, and each block is assigned to each thread block. Each thread corresponds to each row. A block first copies a fixed number (which we denote by C1) of columns to the shared memory. Then compute the distances.
However, in our problem, since the dimension d is large (the target size is d ≈ 10
3 ), it is not possible to copy all the coordinate data to the shared memory even for a small C1. Hence, a thread iteratively reads a fixed number C2 of coordinate values of corresponding vectors. In other words, if v i is expressed as (v
are read in j-th iteration (Fig. 6) .
If a vector is expressed by single precision numbers, C2 must be a multiple of 32 to utilize full power of coalesced memory accesses. The algorithm to calculate the distaces for a given grid is shown in Fig. 7 . Here, the arguments n 1 , {v 1i } n1−1 n=0 , n 2 , and {v 2i }
n2−1 i=0
are given as re-indexed {v i } so that this procedure can calculate for the assigned grid. The index for the block is expressed by bid, and each block has BSIZE × C1 threads. Each thread is indexed by (tx, ty) 6 Phase 2: taking k smallest elements
In the second phase, each thread block is assigned to each row. The smallest k distances are computed by parallel processing of threads in the block. If the number of thread in a block is denoted by nThreads, each thread read distances skipping nThreads, so that memory access is coalesced. A thread check if the element is smaller than current k-th largest element in the heap, and store it in the local buffer if so. This is because k is relatively small than n and it is likely that only a few elements is stored in the local buffer. Because of this mechanism, the waiting time is shortened even though when pushing to the heap, the threads must be synchronized.
The algorithm is shown in Fig. 8 . Here, the index for the block and thread is denoted by bid and tid respectively, and buffer is thread-local array and its size is bufsize.
Experiment
We experimented our algorithm on two GTX280's and one GTX280. For a comparison, we also implemented CPU version and experimented it on Intel i7 920 Fig. 7 . Algorithm for calculation of distances: for simplicity, it is assumed that n1 is multiple of C1 and d is multiple of C2 1: procedure CalcDistances(d,n1,{v1i}
bx ← 0 3:
Prepare the shared memory to store the distances 4:
while bx · C1 < n1 do 5:
l ← 0 6:
Calculate cumulatively all the combinations of v1i and v2i which are in the shared memory and store it in a local resister dist 9:
l ← l + C2 10:
end while 11:
Store the resulting distance dist in the global memory 12:
bx ← bx + C1 13:
end while 14: end procedure (2.67GHz). GTX280 is one of the latest NVIDIA's graphics chips. The algorithm experimented on the CPU is a simple one: it calculates each δ(v x , v y ) (x > y) and pushes it to the corresponding heaps (Fig 9) . Note that although Intel i7 has four cores with hyperthreading capability, we only worked on serial algorithm, i.e. it only uses one core.
The result of the experiment for various n is shown in Table 1 . The other parameters are set as k = 100 and d = 256; and the data is generated randomly. It shows that for a large problem, our algorithm work well from the viewpoint of parallelism of GPUs. Moreover, it also tells the GPUs substantially outperforms the CPU; for a large problem, two GPU implementation is more than 260 times faster than the CPU. for i := tid to n − 1 step nThreads · bufsize do 3:
for j := 0 to nThreads · bufsize step bufsize do 4:
for l := 0 to bufsize do 5:
if a bid,ν is smaller than top of the heap h bid then 7:
Store a bid,ν to buffer 8: end if 9:
end for 10:
Push elements of buffer to h bid (blocking other threads) 11:
end for 12:
end for 13: end procedure Fig. 9 . Serial algorithm used for a comparison 1: Prepar n heaps heap 0 , . . . , heap n−1 2: for x := 0 to n − 1 do 3:
for y := 0 to n − 1 do 4:
Compute δ(vx, vy) 5:
Push the result to heap x and heap y 6: end for 7: end forSort the data in heap 0 , . . . , heap n−1 and return them.
Conclusion
We introduced an effective algorithm for k-nearest vector problem which works on multiple GPUs. By an experiment, we have shown that it runs more than 260 times faster than an implementation on a single core of an up-to-date CPU. We have also shown that the algorithm is effective from the viewpoint of parallelism of GPUs. That is because 1) there is no synchronization between GPUs until the very end of the process and 2) the workload is well balanced.
Future work. We think there is a further room of consideration for the algorithm of "taking k smallest elements." Since this problem is general and can be applied to other problems, further optimization of it is valuable. Since we originally started from a recommendation system, we are also thinking of GPU implementation of a singular value decomposition of a matrix, which gives us total speed-up of the system.
