Abstract. In this paper, we present the simulation of an abstract model of SIMD type with vertical data processing (the STAR-machine) on GPU with CUDA framework. There is a number of algorithms developed for the STAR-machine. The research conducted recently shows that such a model is extremely efficient when used to solve graph problems. Associative operations are the key properties of this model. In particular, all of them take constant time. In this paper, we present an implementation of associative operations on GPU (Graphic Processing Units). This study is aimed at providing a bridge or a general manual instruction to convert the STAR algorithms to the GPU implementation. As the architecture of the STAR-machine in modern technologies has not been built yet, this provides a possible way to implement the STAR algorithms on an alternative platform to verify their correctness and efficiency, especially, for massive data input.
Introduction
The associative (content addressable) parallel processors of the SIMD type with vertical processing and simple single-bit processing elements ideally suit to performing fast parallel search operations being used in different applications such as graph theory, relational database processing, image processing. Such an architecture performs data parallelism at the basic level, provides a massively parallel search by contents, and allows using two-dimensional tables as basic data structures [1] . This class of parallel computers includes the well-known systems STARAN, DAP, MPP, and CM-2. The STAR-machine is an abstract model of the SIMD type. There have been a number of algorithms [2] [3] [4] [5] [6] [7] [8] [9] [10] developed with this model.
While a great number of the STAR algorithms are being developed, their implementation becomes an interesting problem. Because of the platform STARAN is not accessible in todays computer lab settings, it is impossible to truly evaluate the performance of a STAR algorithm with massive data input. The system VisualStar has been developed to simulate the STARmachine [11] . It allows one to edit, to compile, and to implement procedures written in the STAR language. The VisualStar is useful to develop and to test algorithms, but not to calculate a a great body of data because of sequential computation.
We look for a platform that is able to implement the STAR-machine so as to run its algorithms in massively parallel model with high efficiency. An ideal platform must have SIMD architecture and also be easily accessible. There is no doubt that Nvidia GPUs are an excellent choice.
The main objective of using GPU is to accelerate intensive graphic data processing. Later, with introduction of Nvidia CUDA (compute unified device architecture), a high-level programming interface, GPU was evolved to be a powerful computing platform to support the general purpose parallel computation. It has been used in numerous application fields for massively parallel data processing [12] . The GPU is a typical SIMD architecture and is especially good for fine-grained large amount data-intensive parallel computation. Its features provide the possibility of implementing associative parallel algorithms with easy accessibility and high scalability.
To implement a STAR algorithm on an architecture, we need to find a way to execute each basic operation of the language STAR in the corresponding running environment. In this paper, this is our contribution.
A similar work is performed for the other associative computing model (MASC) [13] .
This paper is organized as follows. The model of the STAR-machine is described in Section 2. Section 3 presents the implementation steps for each basis operation of the STAR-mashine on GPU. Section 4 gives the performance analysis.
The STAR-machine model
The STAR-machine is a model of the SIMD type with vertical data processing. It consists of the following components:
• a sequential control unit (CU), where programs and scalar constants are stored;
• an associative processing unit consisting of p single-bit processing elements (the PEs);
• a matrix memory for the associative processing unit.
The CU passes an instruction to all PEs in one unit of time. All active PEs execute it in parallel, while inactive PEs do not perform it. Activation of a PE depends on the data. It should be noted that the time of performing any instruction does not depend on the number of processing elements [14] . To input binary data are loaded into the matrix memory in the form of two-dimensional tables in which each datum occupies an individual row and it is updated by a dedicated processing element. It is assumed that there are a greater number of PEs than data. The rows are numbered from top to bottom and the columns -from left to right. Both a row and a column can be easily accessed. Some tables may be loaded into the matrix memory.
An associative processing unit is represented as h vertical registers (h ≥ 4), each consisting of p bits. The vertical registers can be regarded as a one-column array. The bit columns of the tabular data are stored in the registers which perform necessary Boolean operations and record the search results. The STAR-machine run is described by means of the language STAR [15] which is an extension of Pascal. Let us briefly consider the STAR basis constructions. To simulate data processing in the matrix memory, we use data types word, slice, and table. Constants for the types slice and word are represented as a sequence of symbols of {0, 1} enclosed within single quotation marks. The types slice and word are used for the bit column access and the bit row access, respectively, and the type table is used for defining the tabular data. Assume that any variable of the type slice consists of p components which belong to {0, 1}. For simplicity, let us call slice any variable of the type slice.
Now, we present some elementary operations and predicates for slices. Let X, Y be variables of the type slice and i be a variable of the type integer. We use the following operations: In the usual way we introduce the predicates ZERO(Y ) and SOME(Y ) and the bitwise Boolean operations X and Y , X or Y , not Y , and X xor Y .
Let T be a variable of the type table. We employ the following two operations: ROW(i, T ) returns the ith row of the matrix T ; COL(i, T ) returns the ith column of T . Remark 1. All operations for the type slice can also be performed for the type word.
Remark 2. Note that the STAR statements [2] are defined in the same manner as for Pascal.
Associative operations on GPU
In this section, implementation steps for the basis operations of the STARmachine on the GPU platform are presented. First, the realization of data types is described. Then each associative operation is discussed one by one in regard of its implementation on GPU.
3.1. The types of data. As noted above, to simulate data processing in the matrix memory, we use the data types word, slice, and table in the STAR language.
The GPU uses a conventional way to store data (Random Access Memory). A data item is identified by its memory address. The CUDA C is an extension of C language. Thus, to simulate the data types, we declare classes of the same name. Class Slice is used to simulate the type word, too. The class The device pointer *d_table stores the address of the first element of the array of M × N size of unsigned long long in the device global memory. The host pointer *slice_device_pointer_table and the device pointer *d_slice_device_pointer_table store the address of the first element of the array of N pointers to the columns of d_table. The operations SET(X), CLR(X) and the other bitwise Boolean operations are implemented in the same way. Then, performing these operations takes a longer time by a constant, and does not depend on the slice length.
Now, let us discuss the implementation of the operation FND(X). It is made in two stages:
• At the first stage, the array d_v[N] is reduced to one variable of the type unsigned long long. It performs a call of the global procedure find. At each level, it reduces the array X to the array new x, whose size is 64 times smaller. Each non-zero element of X maps the bit 1 of the new x, and each zero element of X maps the bit 0 .
• At the second stage, the global procedure first_backward computes the result by expending up: Performing the operation FND(X) takes a longer time by the factor O(log N ) for a slice with the length of N . The operation STEP(X), the predicates SOME(X) and ZERO(X) are performed in the same way. Table. The operation COL(i, T ) is performed by calling T.col(i):
The operators on the class
It is used to access to columns both for reading and writing.
The ROW(i, T ) is implemented with two procedures. The procedure SetRow(Slice *s, int i) is used to write the word s into the ith row of the table. And the function Slice * Table: :row(int i) returns the pointer to the word, which matches with the ith row of the table. The same algorithm is used in both cases:
• In each column, there is an element, which includes one bit of the row.
Thus, the following indices are computed: the element number and the bit number. The indices are the same for all columns.
• Bits are read or written in parallel by columns. We have M (the size of the table) variables tmp of the type unsigned long long int, in which the kth bit is equal to the ith bit of the kth column, and the other bits are equal to zero.
• To read the row, all bits are gathered into the word in the following manner.
Performing the operations COL(i, T ) and ROW(i, T ) takes a longer time by a constant factor, but the COL(i, T ) is faster than the ROW(i, T ).
The performance analysis
Now let us consider the simulation of the Warshall associative parallel algorithm. This algorithm was described and proven in [2] . The procedure receives as input data an adjacency matrix of the graph with n vertices and returns the path matrix for the transitive closure of the graph.
proc WARSHALL(n: integer; var P: table); /*Here n is the number of graph vertices.*/ var X: slice(P); v,w: word; i,k: integer; begin for k:=1 to n do begin x := COL(k,P); w := ROW(k,P); while SOME(X) do begin i := STEP(X); v := ROW(i,P); v := v or w; ROW(i,P) := v; end; end; end;
It uses the following basic operations: or, SOME, COL (to read), ROW (to read and to write), and STEP. It is sufficient to consider these operations, because other operations are realized and run by the same way.
The basis operations runtime.
To estimate the running time, the Warshall algorithm is performed on the two graph examples: with 100 and 1000 vertices. The result of profiling is shown in the table.
The basic operation ROW(i, T ) is simulated by the procedures get_row and set_row. The first of them runs a similar time, which does not depend on the size of a row. The run time of the second procedure grows twice by increasing the size of a row by a factor of 10.
The procedures find and first_backward perform the operation FND(X) (one calls the procedure put after them to implement STEP(X)). Their run time is similar because reducing the arrays is performed once for the number of vertices smaller than 64 × 64 = 4096. The number of reduc- tion operations and the number of vertices are related in the following way: n reducing for the number of vertices in [64 n , 64 n+1 ]. The procedure put is used to implement the basic operations X(i) and STEP(i). It runs in a constant time.
The procedure or_long_value performs the basic operation X or Y and runs in parallel at a constant time. The other bitwise Boolean operations are similarly situated.
The basic operation COL(i, T ) is performed by passing a pointer to the column.
4.2.
The data size. Now we consider the issues associated with the data size.
If a graph with n vertices is represented as an adjacency matrix, then we need 8(n( n/64 + 2) + 1) bytes to keep it. The correlation between the size of the global memory and the number of graph vertices is shown on the figure. Thus, we need ≈ 3.78 Gbyte of the global memory for a graph with 180K vertices. proc WARSHALL-1(n:integer; var P: table); /* Here n is the number of graph vertices.
P is transposed adjacency matrix. */ var X: word; v,w: Slice; i,k: integer; begin for k:=1 to n do begin x := ROW(k,P); w := COL(k,P); i := STEP(X); while i>0 do begin v := COL(i,P); v := v or w; COL(i,P) := v; i := STEP(X); end; end; end;
Conclusion
In this paper, we have proposed the implementation of associative operations of the STAR-machine on the GPU architecture. As is shown in Section 4, most of these associative operations can be implemented on GPU with an extra O(log N ) efficiency loss in theory. However, as is shown in Section 5, the associative operators are implemented with an efficiency close to constant. This is due to the fact that software programmed steps on GPU can be directly used to convert a STAR algorithm so as to be implemented on the GPU-CUDA platform.
In the future, we are planing to implement of the library of basis and auxiliary procedures from [2, 3] . And, perhaps, some blocks of the STAR operators can be performed taking into account the differences between the STAR-machine and GPU.
Our purpose is to extend the implementation to a technology, which would allow one to use and to develop associative parallel algorithms.
