Abstract. The multi-comparand associative search paradigm is shown to be efficient in processing complex search problems from many application areas including computational geometry, graph theory, and list/matrix computations. In this paper the first FPGA implementation of a small multi-comparand multi-search associative processor is reported. The architecture of the processor and its functions are described in detail. The processor works in a combined bit-serial/bit-parallel mode. Its main component is a multi-comparand associative memory with up to 16 programmable prescription functions (logic searches). Parameters of implemented FPGA devices are presented and discussed.
Introduction
Searching is one of the key concepts in computer science and engineering [12] . Searching problems arise in different application areas. As a result of increasing searching tasks at hand time requirements are becoming more critical. Therefore, new techniques that can speed up search operations are required. One solution is provided by massively parallel computing.
Among many parallel models associative machines, that belong to broader SIMD category of parallel processors, are particularly well suited for performing fast parallel search operations. In content-addressing mode associative memories use build-in multiple comparison capability [2, 3, 20, 22] . Recent advances in associative processing include : classification and forwarding network search engines [27, 28] , massively parallel associative processor IXM2 [6] , experimental optical associative architecture [16] , new applications in databases, computational geometry and artificial intelligence [11] , new selection and extreme search algorithms for fully parallel memories resulting from average-case analysis [17] , multiassociative/hybrid search [4] , etc.
Most associative machines work in bit-serial word-parallel mode, with a single comparand and multiple data. Results of multiple comparisons are stored in a tag memory and are resolved by a specialized circuitry (logical tag resolver). However, many alternative paradigms were also developed. Other commonly used working modes include fully parallel (bit-parallel word-parallel), word-serial, block-oriented, etc. Various machines implement different sets of basic logical matching functions. There exist architectures without tag memory, and solutions with separated tag and word mask circuitry. Logical tag resolvers often vary in their function and structure. All this leads to a variety that fits many particular processing requirements.
One common drawback of mainstream associative models is usage of a single comparand only. Although there exist particular applications, where simultaneous many-to-many comparisons are feasible without any need to use an external comparand vector [8, 9] , a common practice is to process many-to-many comparisons sequentially, by performing a sequence of one-to-many compare operations. Eventually, parallelization is achieved through processing of many copies of input data simultaneously. Design of a processor with multiple-comparand processing capability would make the multi-comparand associative search much faster. The problem was stated for the first time in [1] , where a versatile search memory with many-to-many comparison was proposed. Recently, associative algorithms with multi-comparand search operations were developed for six representative problems from various complexity classes, i.e. SET INCLUSION, GEOMETRIC RANGE SEARCH, MATRIX INCLUSION, NEIGHBOURHOOD, MULTIPLE SEARCH and MULTI-LIST RANKING [14] .
At first the idea of multi-comparand search was rejected due to a fundamental missunderstanding (see [3] , pp. 198-202). High cost of the required hardware seemed to be the second obstacle in developing new associative architectures.
Current progress of IC technology resulted in successful ASIC designs of conventional single-comparand associative processors with extended search operations and memory cells designed on the transistor level [5, 7, 18, 21] . Singlecomparand memories became embedded features of Spartan-II and Virtex programmable FPGA devices from Xilinx. New FPGA-based asociative memory designs "XAPP 20x" were presented in the recent years [23, 24, 25, 26] . However, to our best knowledge, no attempt was reported to build a multi-comparand associative processor using ASIC/FPGA technology.
In this paper low cost experimental FPGA design of a simple hierarchical associative processor is reported, that implements the idea of multi-comparand search [14, 19] . A multiple-comparand associative memory with 2D tag memory is a key module of the processor. The memory implements up to 16 different logic search functions. In order to extract global features of processed data, associative search capability is introduced into the 2D tag memory too. Thus, the associative search is organized hierarchically, on the two levels.
The design was performed with Xilinx Foundation Series Software (Student Edition v.2.1). We used low cost XS40 demo board from XESS with single Xilinx XC4005XL FPGA device. The obtained results are very promissing having in mind steady effort to take advantage of associative processing implemented in FPGA technology [23, 24, 25, 26] .
In the next section the processor architecture is described. Section 3 is devoted to associative memory design on logic level. In section 4 implementation data of a small FPGA prototypes are collected. Section 5 suggests possible directions of a further research.
The Multi-comparand Processor Architecture
The multi-comparand associative memory may be described as follows. Basic data processing is organized in bit-serial word-parallel mode, but the conventional single-comparand register is replaced by a comparand array, and the tag memory is extended to the size of the Carthesian product of data and comparand sets. Processing is performed neither in the data array, nor in the comparand array, but exclusively in the 2D tag memory. In each consecutive step of processing each tag memory cell is updated according to : its previous state, the corresponding data/comparand values, and so called prescription function that defines type of the search (24 logical and 10 arithmetical searches are valid [1] ). Type of the search determines also the initial state of the tag memory. The possibility to define a prescription function for each tag cell separately is a source of the memory's versatility but it requires an extra logic and log 2 q control bits for each cell, where q is the number of different prescription functions. Obviously, presumptively homogenous cell functions could be handled in a less hardware consuming way, reducing the number of control bits. Applying current technology, the processor's tag memory can be implemented as a reconfigurable logic, with configuring data loaded from a PRAM, providing substantial hardware savings and essential improvement of time parameters. Comparison results collected in the 2D tag memory have to be further processed in order to extract global search results. Otherwise, partial intermediate data collected in the tag memory are difficult to interpret. This implies a need of providing the 2D tag memory with basic associative processing capabilities (exact match/mismatch). As a result, a simple multi-comparand multi-search associative processor architecture is obtained, which is depicted in Fig.1 . The processor consists of the following components [14] : to a presription function loaded into TM from PRAM PFG; TM has built-in capability of fully parallel (bit-parallel word-parallel) associative processing its content with single comparand C2 and search mask SM2 -results of single comparand EQUALITY search in TM are stored in T2; TM can also perform SELECT FIRST function [3] in each column) PRAM PFG -PRAM prescription function generator; (for our needs the number of prescription functions is 6, 12 or 16 but in general this set can be freely modelled for a given application) C2 -1 × m binary comparand vector; SM2 -1 × m TM mask vector; T2 -n × 1 binary tag vector for TM; LTR -logical tag resolver; (in particular, it computes binary variables w and u : w = 1 iff at least one unmasked bit of T2 is equal 1, while u = 1 iff all unmasked bits of T2 are equal 1; variable w, often denoted as SOME/NONE, is commonly used in tag resolvers [3, 11] ; variable u, which can be denoted as ALL/NOT ALL is used in [11] ).
The above processor architecture is described in a basic configuration. Depending on particular application it is necessary to add a number of the optional architecture components:
COUNTER OF RESPONDERS -counter of the number of responders for T2 [3] ; SELECT FIRST -circuit for each column of TM and T2 [3] ; (p,k)-COMBINATION GENERATOR -hardware mask generator for SM1 (and, eventually, similar one for SM2 [13] ; (p,k)-PERMUTATION GENERATOR -hardware comparand generator placed between DA and CA [13] ; CONTROLLERS for sequencing processor operations in TM (required for processing some specific problems only). MEMORY REGISTERS -storage of temporary results of computations, like tag vectors, mask vectors, etc.
The machine model is powerfull and flexible enough to satisfy numerous requirements in fast associative processing. The model is a parametrized generalization of many other simpler models proposed earlier and can substitute them functionally, i.e. process all algorithms derived for those models. For instance, models of single-comparand processors are equivalent to our model with the parameter m=1. In many cases, where multi-comparand searching is crucial, the model provides a significant improvement of algorithms' performance.
The proposed architecture satisfies the following design assumptions which can be used for evaluation of combinatorial algorithms developed in our model:
1. single bit-serial search operation lasts a unit time; 2. all bit-serial word-parallel search operations last a time proportional to the number of bit slices, i.e. O(p) or less; 3. TM prescription functions can be loaded from PRAM PFG in O (1) time; 4. single mask programming is performed in O(1) time; 5. single comparand permutation is performed in O(1) time; 6. LTR, SELECT FIRST circuits and COUNTER of the number of respondes consume a time that depends on the construction of these components and available technology. In general that time is a logarithmic function of n.
The Design
The processor architecture consists of data and comparand registers, associative memory TM, output register T2, specialized output circuits and a control unit. The architecture is easy scalable since its basic components have very regular structure. Below we will describe in detail the most important unit, i.e. associative memory TM. Tag memory TM is built of identical cells designed for 16 different logic searches. We started from implementing 6 basic logic operations and gradually increased their number to all 16 possible searches. Two variants of TM cell were considered: one with latches and the second with D flip-flops as storage elements. Each cell has the 9 inputs -CM (mask), S0-S3 (search selector), DA (data bit), CA (comparand bit), WRITEEN (write enable), CLEAR (clear), two state variables -P, Q (memory state) and output O (search result).
The following notation is used: p -word size; i -bit position (0 ≤ i ≤ p-1); Q i -state of memory element Q in ith cycle (
Description of a Tag Memory cell with full logic capabilities (16 logic searches) is presented in Table 1 . 
The FPGA Prototype
For prototyping Xilinx Foundation Series Software (Student Edition v.2.1) and a demo board with a single Xilinx XC4005XL FPGA device, which is one of the smallest of XC4000 series, were used. The XC4005XL device -with density 9k gates -contains 196 CLBs and 112 IOBs. Prototyping work was twofold. The first aim was to build a small multicomparand processor with basic functionality. Only six logic searches were implemented, i.e. { =, =, <, >, ≤ and ≥}. The maximum size of the prototype obtained was 6×6×4. In that case 36 simultaneous comparisons were performed with the clock rate 68 ns per bit. Parameters of exemplary processor projects are shown in Table 2 . Asymptotic circuit complexity is O(n 2 ) (it is dominated by TM size), while the clock rate decreases slower then linearly with n. Therefore the second purpose was to build and test only TM array which consumes most of the circuit CLBs. TM was build in three versions: with 6, 12 and 16 build-in logic operations. The maximum size of the prototype obtained was 5×5 and about 75% of device resources were used. Higher functionality is traded for circuit speed but the resulting TM array is surprisingly quite fast. Designs with latches instead of D flip-flops are a bit faster and smaller in terms of circuit size. Parameters of exemplary TM array projects are shown in Table 3 .
If XC4085 FPGA was used (which has the highest density in XC4000 series) with 4 times more IOBs and 16 times more CLBs the estimated TM array size could be 10 x 10 due to I/O limitations, with 600 out of 3136 CLBs used (utilization about 30%).
All designs were carefully tested [19] . Small instances of SET INCLUSION, GEOMETRIC RANGE SEARCH, MULTIPLE SEARCH and MATRIX IN-CLUSION problems were processed in our multi-comparand multi-search processor with associative algorithms described in [14] . The analysis of waveforms confirmed correctness of the designed devices.
Concluding Remarks
In this paper an FPGA implementation of multi-comparand multi-search associative processor was proposed which enables highly parallel search operations. The processor is versatile and flexible enough to meet a wide range of requirements, and solve combinatorial problems belonging to various complexity classes including polynomial, P-complete, isomorphic-complete and NP-complete problems.
It is not recommended to implement Tag Memory array as a separate FPGA device. In high density FPGAs the main bottleneck is limited number of IOBs. Hence, one processor should be implemented in single FPGA. There is a need to evaluate sizes of necessary data array DA and comparand array CA and resolve their read/write operations taking into account existing limitations. Basically, all necessary additional components should be included within one chip. Sometimes however, implementation of some procesor components as separate devices may be reasonable.
In cases when one processor is not sufficient for given problem size, a number of processors may be applied using known multiassociative/hybrid techniques described in [4] . Partitioning particular problem data and deriving global solutions from distributed partial results still remains an important research problem.
Some particular topics may be objects of further investigation. For instance, establishing the relationships between desired associative memory cell structure and embedded macrocell functions may contribute to better utilization of FPGA hardware. Various techniques of speeding up search operations may still be discovered (see [17] ). Finally, a closer inspection of various classes of combinatorial problems should significantly extend the application domain as well as result in many new algorithms developed on the basis of multi-comparand search operations.
