In this paper 
Introduction
It is well known that associative (content addressable) parallel processors of the SIMD type are particularly well suited for performing fast parallel search operations in many search-intensive algorithms. Recent advances in associative processing include new applications in computational geometry, relational databases, and artificial intelligence [6] , new selection of extreme search algorithms for fully parallel memories resulting from average-case analysis [15] , multiassociative/hybrid search [4] , an experimental implementation of multi-comparand multi-search associative processors and corresponding parallel algorithms for search problems from various complexity classes [7, 8] .
In fully parallel associative machines every cell has built-in single-bit comparator and therefore some searches (f.i. exact match, mismatch) can be performed faster than in bit-serial (vertical) machines where the word size is crucial. However, searches that require sequential (bit-serial) processing of words are asymptotically equivalent in both models. Significant speedup over sequential processors can be achieved in multi-comparand fully parallel processors, where the set of comparands is matched simultaneously against the data set [1] .
Graph algorithms are specific class of algorithms that deserve particular interest of researchers and programmers [16] . Many graph algorithms were implemented on associative parallel processors. Such algorithms employ different graph representation forms given as two-dimensional tables.
In [2] , a special case of Dijkstra's algorithm for finding the shortest path between two vertices in unweighted undirected graphs and Floyd's shortest path algorithm have been represented on the bit-serial word-parallel associative processor LUCAS.
In [5] , simple algorithms for a group of graph problems (independent set, clique, chromatic number, graph isomorphism) have been proposed on the single comparand fully parallel associative model RAMRC. It is supported by dedicated generators of combinatorial objects for mask and pattern generation.
In [10] [11] [12] , a natural straightforward implementation of a group of classical graph algorithms using simple and natural data structures has been proposed on the model of associative parallel systems wih vertical processing (the STARmachine). This group includes associative versions of the following algorithms: Warshall's algorithm for finding transitive closure and Floyd's shortest path algorithm, Dijkstra's algorithm and the Bellman-Ford one for finding the singlesource shortest paths.
However, all above-mentioned models of associative computation impose some bounds on the set of processor operations. Efficient implementation of associative algorithms for a new class of graph problems requires new operations and extensions to existing processor models.
In this paper, we introduce a new associative machine model called Associative Graph Processor (AGP) suitable for implementation of associative algorithms that require, in addition to search capabilities, a set of graph related operations. The model can perform bit-serial and fully parallel associative processing of matrices representing graphs that are stored in memory as well as some basic set operations on matrices (sets of columns) like copy, addition, minimum, maximum and merging. We will describe these novel features in detail and discuss its possible hardware implementation.
Model of Associative Graph Processor
In this section, we propose a model of the SIMD type with simple single-bit PEs called Associative Graph Processor (AGP). It carries out both the bit-serial and the bitparallel processing. To simulate the access data by contents, we use both the typical operations for associative systems first presented in Staran [3] and some new operations to perform bit-parallel processing.
The model consists of the following components: -a sequential common control unit (CU), where programs and scalar constants are stored;
-an associative processing unit forming a twodimensional array of single-bit PEs; -a matrix memory for the associative processing unit.
The CU broadcasts each instruction to all PEs in unit time. All active PEs execute it simultaneously while inactive PEs do not perform it. Activation of a PE depends on the data employed.
Input binary data are loaded in the matrix memory in the form of two-dimensional tables, where each data item occupies an individual row and is updated by a dedicated row of PEs. In the matrix memory, the rows are numbered from top to bottom and the columns -from left to right. Both a row and a column can be easily accessed.
By analogy with other models of associative processing, we assume that input data are loaded in the matrix memory before updating them.
The associative processing unit is represented as a matrix of single-bit PEs that correspond to the matrix of input binary data. Each column in the matrix of PEs can be regarded as a vertical register that maintains the entire column of a table. Our model runs as follows. Bit columns of tabular data are stored in the registers which perform the necessary bitwise operations.
To simulate data processing in the matrix memory, we use data types slice and word for the bit column access and the bit row access, respectively, and the type table for defining and updating matrices. We assume that any variable of the type slice consists of Ò components. For simplicity, let us call "slice" any variable of the type slice.
For variables of the type slice, we employ the same operations as in the case of the STAR-machine along with a new operation FRST( ).
The operation FRST( ) saves the first (the uppermost) In the usual way, we introduce predicates ZERO( ) and SOME( ) and the bitwise Boolean operations
The above-mentioned operations are also used for variables of the type word.
Let Ì be a variable of the type table. We use the following two operations:
ROW´ Ì µ returns the -th row of the matrix Ì ; COL´ Ì µ returns the -th column of Ì .
Moreover, we use two groups of new operations. One group of such operations is applied to a single matrix, while the other one is used for two matrices of the same size. 
A Group of Basic Procedures
In this section, we propose a bit-parallel implementation of a group of basic procedures on the AGP model. In particular, we employ these procedures to represent different algorithms on associative parallel systems.
The procedure WCOPY(Û ) writes the binary word Û in the rows of the result matrix that correspond to ones in the given slice . Other rows of will consist of zeros. On the AGP model, it is implemented as follows. The procedure WCOPY runs as follows. The slice is simultaneously written in the columns of that correspond to ones in the given Û, while the slice is simultaneously written in other columns of .
The procedure WMERGE(Û ) writes the given string Û in the rows of the given matrix that correspond to ones in the given slice . Other rows of do not change. The procedure WMERGE runs as follows. We first save the given string Û in the matrix rows marked by ¼ ½ ¼ in the slice . Then by means of the matrix , we save the matrix rows that correspond to ¼ ½ ¼ in the slice (that is, negation of ). Finally, after performing ÓÖ´ Úµ, we obtain the result of this operation.
The procedure TMERGE(T,X,F) writes the matrix Ì rows, marked by ones in the given slice , in the corresponding rows of the result matrix . The rows of that correspond to zeros in the slice do not change.
procedure TMERGE (T:table; X:slice;  var F:table) Let us explain the main idea of this procedure. In the matrix , we simultaneously save the rows of given matrix Ì that are marked by ones in the slice . Then in the matrix , we simultaneously save the rows of the matrix that correspond to zeros in the slice . These rows of will not change. The result matrix is obtained after fulfilling the statement ÓÖ´ Úµ. The procedure HIT(Ì ) defines positions of the corresponding identical rows in the given matrices Ì and using the slice . It returns the slice , where ´ µ This procedure runs as follows. First, we simultaneously compare the corresponding columns in the given matrices Ì and and save the result in the matrix . Then we simultaneously perform the disjunction of all rows of the matrix and save the result in the slice . To define the result slice , we take into account only the matrices Ì and rows that are marked by 
On the STAR-machine, all these procedures take Ç´ µ time each [9, 11] , where is the number of bit columns in the corresponding matrix. It should be noted that other basic procedures for nonnumerical processing can be also implemented on the AGP model. Therefore a group of associative parallel algorithms that employ such basic procedures can be implemented on this model. In particular, this group includes associative versions of Kruskal's algorithm and the Prim-Dijkstra one for finding the minimal spanning tree [13] , the associative parallel algorithm for finding a fundamental set of circuits [14] , and associative parallel algorithms for dynamic edge update of minimum spanning trees [9] .
An Implementation of the AGP Model
The Associative Graph Processor architecture that is shown in Fig.1 , where l is the coordinate of processed bit slice, according to a search function loaded into TA. The number of basic logic searches generated by SFG is 6, but in general this set can be extended to 16 according to a given application [1] .
In the AGP model, data processing is organized in bitserial word-parallel mode. The typical single-comparand register is replaced by a Comparand Array, and the tag memory is extended to the size of the Carthesian product of data and comparand sets. Associative processing is performed exclusively in two-dimentional Tag Array. In each consecutive step of processing states of all PEs (TA cells) are updated on the basis of its previous state, the corresponding data/comparand values, and the search type selected by SFG. The type of the search determines the initial state of TA. Results of comparisons are stored in the Tag Array and must be further processed in order to extract global search output. Therefore, the Tag Array has built-in capability of performing fully parallel (bit-parallel wordparallel) exact match/mismatch search with a single comparand C2 and search mask SM2 -results are stored in T2.
Depending on a particular application it is necessary to add to the basic architecture of AGP a number of optional architecture components like counter of responders, select first circuit, generators of combinatorial configurations, controllers for sequencing processor operations in TA and extra memory registers for specific purposes.
The model is a parametrized generalization of many other simpler models proposed earlier and can substitute them functionally, i.e. process all algorithms derived for those models. For instance, models of single-comparand processors are equivalent to our model with the parameter m=1. In many cases, where multi-comparand searching is crucial, the AGP model provides a significant improvement of algorithms' performance.
In [7] , the FPGA implementation details of a very similar associative processor with extensive search capabilities is reported. The design was performed with Xilinx Foundation Series Software using low cost Xess XS40 demo board with a single Xilinx XC4005XL FPGA device. The device contains 196 CLBs and 112 IOBs. Only six logic searches were implemented, i.e. =, , , , and . The maximum size of the prototype obtained was 6¢6¢4. In that case 36 simultaneous comparisons were performed with the clock rate 68 ns per bit (max frequency 14.7 MHz). Asymptotic hardware complexity of the FPGA device is Ç´Ò ¾ µ (it is dominated by the size of the Tag Array).
We hope that the implementation data of this device closely approximate an AGP implementation. It should be noted that the practical size of the associative graph processor depends strongly on the graph problem at hand.
Conclusions
In this paper, a model of an associative graph processor has been proposed which enables highly parallel search operations. We have also presented a group of basic procedures being used for implementing associative parallel algorithms. The AGP model is versatile and flexible enough to meet a wide range of requirements and to solve combinatorial problems from various complexity classes.
In contrast to other models of associative processors, the AGP model provides faster and more versatile parallel search operations and a number of additional operations on two-dimentional matrices that represent input data. It can emulate many simpler associative models used so far.
We intend to continue work on the development of the processor and to study new features useful in associative graph processing. We are also planning to extend its application domain and to design new efficient associative algorithms using multi-comparand search operations.
