Abstract-This paper introduces a new approach to acceleration of nonnumeric, database, and information retrieval operations. While traditional techniques accelerate the most time-critical high-level software constructs, we propose novel low-level primitives and demonstrate how these primitives improve database operations. Radix sorting, hashing, and bit-vector operations are used to develop a new class of nonnumeric algorithms-OTHER (Ordered Table Hashing and Radix sort algorithms)-based on low-level hashing operations Init, Mark, and Scan. We have proposed and evaluated two hardware accelerators for OTHER algorithms. It is shown that a low complexity hardware support (less than 10K transistors) can significantly improve the performance of nonnumeric operations.
INTRODUCTION
C ONVENTIONAL computer systems, by their very nature, are sequential machines, supported by an arithmetic logic unit structured for numeric computations and a passive address-accessible memory hierarchy. The ability to process efficiently large amounts of nonnumeric data is crucial for many computer applications, such as database and information retrieval. They are characterized by simple, repetitive nonnumeric operations on a massive volume of data where, in general, data locality is not preserved. This incompatibility has resulted in the so-called semantic gap, computation gap, and size gap [1] .
The challenge to reduce the aforementioned gaps has motivated a great deal of research since the mid 1970s, e.g., database machines (DBM) [1] . Using the processor architecture and database functionality as classification taxonomies, one can distinguish three classes of database machines: application-specific DBMs, general-purpose DBMs, and general-purpose computers with increased database performance. However, it has been shown that suitable database performance can be easily achieved using VLSI accelerators [2] , [3] , [4] , [5] , [6] , [7] . The approach proposed in this paper is based on hardware acceleration of general purpose CPUs.
A great deal of research in the field of database machines has focused on the development of dedicated database architectures [1] , [8] , [9] , [10] . Parallel to this work, some vendors have developed general purpose machines with database functions such as select, search, and join integrated directly into the machine architecture [11] . In spite of the commercial success of database machines (hardware or software-based organizations), the generality of the von Neumann architecture has also motivated another approach to the efficient handling of database systems, i.e., machine-level instructional support for operations that can improve the performance of the database operations (e.g., bit string manipulation instructions). For example, Intel i386/486 processors support bit manipulation instructions [12] that can be used to implement primitive database operations [13] . Interestingly, the Teradata database machine also uses the very same set of operations to facilitate the efficient execution of database functions [14] . Our simulation results show that, even in the case of a manually optimized assembly program based on these dedicated instructions, acceleration of complex bit-manipulation operations is 1.5 to 3 times compared to an equivalent C program [13] . However, in practice, the performance improvement for database functions is much less.
Advances in VLSI technology would suggest an alternative approach to improve performance of the database functions, namely, VLSI accelerators. Most of the realized VLSI accelerators for the database environment are intended to accelerate high-level functions, such as select [9] , sort [3] , [6] , and join [1] . This paper proposes an alternative to this direction; simply put, instead of accelerating high-level operations, we accelerate "lower-level" operations (see Section 2.1). Conventional RISC instruction set optimization is generally based on statistics of (mostly numeric) benchmarks implemented using an existing instruction set. Therefore, in most cases, it represents optimization of the existing instruction set. Nevertheless, a different set of basic operations generates completely different execution statistics for the given application. Since the design of a dedicated instruction set can hardly be economically justified, we propose a standard, general-purpose processor core communicating with a low-level operation accelerator. The proposed Low Level Accelerator (LLA) can be implemented either as an on-chip resource or an off-chip coprocessor. Although the LLA concept is not new, to the best of our knowledge, it has not been used extensively for nonnumeric database operations. Order hashing primitives resolve the semantic gap by offering a mechanism to directly position every processed object in the approximate location in the final result set. Efficient processing of modified algorithms allows processing of larger data sets in shorter time, providing a cost-efficient solution for computation and size gap problems.
Section 2 addresses our general approach and its mathematical foundation. In addition, it introduces our basic algorithms. The set of proposed algorithms, introduced in Section 2, is extended for hardware acceleration. Analytical modeling of select, sort, and join database operations is discussed in Section 4. The design of an accelerator for the proposed primitive operations is introduced in Section 5. Section 6 presents the simulation results and, finally, Section 7 concludes the paper.
LOW-LEVEL ACCELERATION
This paper presents optimization of the cost/performance ratio of nonnumeric database operations based on the lowlevel ordered hashing primitives. Implementation hierarchy is presented in Table 1 . Standard database operations are implemented using ordered table hashing operations that rely on low-level hashing primitives. In the modified algorithms, we observe some suboptimal low-level operations that offer a higher speedup through on-chip acceleration. The relative share of low-level table processing operations is up to 90 percent, as presented in Section 6 (Table 3) . Consequently, we focus on an unorthodox research direction: introduce a new set of primitive operations, accelerate the most critical operations in this new set of primitive operations, synthesize high-level functions, and introduce modified algorithms for basic database functions based on this new set of primitive operations.
The proposed set of primitive operations is called OTHER (Order Table Hashing and Radix sort) since it is heavily based on a hashing technique for database operations. We implement the most frequently used database operations, select, sort, and join, using ordered table hashing, based on bit vector operations for effective table processing. The low, medium, and high-level operations are described in Sections 2.1, 2.2, and 2.3, respectively.
Low-Level Operations/Table Manipulation Primitives
We decided to develop the OTHER algorithms based on the order-preserving hashing technique because of the ability of hashing to reduce the search space and, hence, to derive more efficiency [15] . Commonly, the results of select database operations must be sorted. Therefore, instead of scattering the records across the hashing table, we use the order-hashing functions and the Address Calculation Sort Method [3] , [16] , [17] to order the records in a logical fashion in the hash table. Finally, the collision processing overhead among the duplicate and synonym keys (records with different key values, generating the same hash value) is reduced by using the logic identifier of each record. The logic identifier uniquely represents each record based on its relative position within the processed data set. This allows a unique correspondence between the record position and its position in collision sets [18] , [19] , [20] . Moreover, instead of maintaining a separate pointer in the list of collisions [3] , the logic identifier itself can be used as a pointer to the next member of the list of collisions. In the proposed algorithms, the logic identifier is implemented simply as a record counter during the initial processing phase. Table 2 summarizes the notation used in this section. Formally, assume we have an unordered set of N objects A
and an ordered hashing function # generating values in a domain D 2 with maximum cardinality M,
The hash function # preserves the order of objects in set A such that, for every pair fa i ; a j g and a given relation & between objects, we have
Let S be an ordered hashing table
and C be a set of synonyms
Members of sets S and C can be object identifiers or pointers to objects. We consider each record to be identified by its relative position within a block of records in the main memory. As a result, sets S and C can be realized as vectors of logic identifiers and, therefore, s i and c i are equivalent to S½i and C½i, respectively. This approach allows efficient computer implementation using sort array S and collision array C. For example, if the objects a i represent numbers, The ordering process requires two basic operations for every processed object:
. Mark, to write the object identifier into the ordered hashing table S and link synonyms with the same hash value using collision table C. . Scan, to retrieve the next object identifier in the sorted order. Successive calls of operation Scan generate a set of object identifiers with hash values between given limits in the sorted order:
We use three low-level primitive operations in the proposed middle and high-level algorithms:
. TableInit initializes the entries in the hash table by means of "null pointers" (Algorithm LL-1). . Mark operation M Markð#; a; iÞ performs hashing of the ith object ða i Þ using hash function #ða; iÞ. Marking is a three-phase process: First, the hash value of the current partition of the key (#ða i Þ) is calculated, then the logic identifier of the key (i) is written into the hash table S, and, finally, the logic identifier is linked into the list of synonyms C (Algorithm LL-2). In our research, the hashing function is implemented in software to extract log M bits from the processed key. Additional acceleration and a smaller number of collisions can be achieved using a hardware-based hashing function [21] . Programmable hashing function can be easily integrated as a part of the accelerator.
. Low-level primitives TableInit, Mark, and Scan perform ordered hashing of individual objects (database attribute values). Medium-level procedures comprise the low-level procedures and perform ordering of all objects in input set A. The most important problem of hash-based algorithms is resolution of synonyms [22] . Collisions are linked to comprise a set called the equivalence class or E-class. Definition 1. The equivalence class or E-class E j is the collection of a i such that
where # j < # jþ1 ; 1 j < q; 1 q M:
All members of the equivalence class must be processed to resolve collisions and make final order within the class. This is usually performed in software using algorithms with fast sorting of small sets, such as the List Insertion Sort method (LIS) [22] . The equivalence class consisting of only one object is called final and requires no further processing. Based on the key size and available main memory, the order-preserving hashing function can be applied to the whole key or a key partition for partial ordering. Similarly to radix sorting, the sorting starts from the most significant position of the key. Keys are sorted by recursive use of ordered hashing and the LIS sort. Partitions involved in radix sort are not always taken from the absolute start of the key; rather, we can start from the most distinctive part of the keys.
The efficiency of hash-based algorithms critically depends on the quality of the hash function and the number of collisions it generates. The worst-case performance, when all the keys are hashed into the same table entry, traditionally creates fear of hashing. It has been shown that the hash function can achieve analytical performance with real-life data [21] . If the hash function generates all hash values with equal probability, then the number of collisions can be approximated using Poisson distribution [23] . The number of single member E-classes (final classes without collisions) N f after processing N keys using a table of size M is
while the number of occupied entries in hash table N m will be
According to (7) and (8), when a large hash table is used (M >> N), almost all E-classes are final (N f % N). This means that the result of the initial sorting phase contains a sorted list with a small number of collisions, requiring further sorting. On the other hand, for small hash tables (M << N), N f is small and the average problem size is reduced from N to N=M. This is particularly important for operations with nonlinear execution time, such as sorting (with OðNlogNÞ complexity). In conclusion, a large hash table decreases problem size and collision processing time at the expense of increased table processing time and larger memory requirements.
OTHER algorithms link members of E-classes using the logic identifier of each record. In this way (similar to the techniques proposed in [22] , [24] ), there is no need to maintain additional pointers. Consequently, according to Algorithm LL-2, the equivalence class could be defined recursively as
The first element of the E j (S[j]) is called the header. This is the only element stored in hash table S and it is used to access all the other members of the E-class stored and linked in the collision table. According to Definition 1 and (3), the set of the equivalence classes is also ordered. Therefore, let us define an ordering function F that generates an ordered set of E-classes:
If the hashing function cannot be applied to the whole key value, the key values can be partitioned and the aforementioned hash-based sort algorithm can be applied, recursively. Formally, let us assume the following partitioning of key a i into L partitions:
When partitioning is used, a low-level Mark operation is applied to the current key partition: Markð#; a; i; kÞ performing hashing on the kth partition of the ith key ða ik Þ using hash function #. Ordering function F is applied recursively:
It can be seen that, after l partitions, even for a small hash table, the average problem size is reduced at least from N to n l , where n l is the average length of list of synonym keys. After l partitions, the average E-class length n l is equal to
In the case of sorting, according to (13) , the complexity of the algorithm OðNlogNÞ is decreased to OðNÞ þ Oðn l logðn l ÞÞ, where n l is the average E-class length after processing l partitions. In this case, even a suboptimal algorithm, such as List Insertion Sort (LIS) [22] can be used to sort synonym keys. Our measurements on the DEC Alpha workstation indicate that the LIS is the most efficient solution for sorting E-classes of up to 27 collisions.
Algorithm ML-1 presents the general E-class processing procedure. It orders all members of the E-class with the kth key partition between LowLimit k and HighLimit k . This is the major medium-level building block for the OTHER algorithms. The result of the Scan operation (set É) is pushed on the stack for further processing. For example, the ordered select algorithm (Algorithm ML-2) will repeatedly call on the E-class processing. The ordered select procedure is composed of three major blocks:
. The initial loading performs the highest level ordering and creates an initial set of E-classes from the original set ðFðA; #ða i1 ÞÞ. . The second phase orders the set of E-classes, satisfying the selection criterion. For optimal acceleration, only Lopt partitions are processed and single member E-classes are placed directly to the result stack. We can represent this as FðA; #ða il ÞÞ, 2 l Lopt. . Finally, in the third phase, the remaining short lists are processed online during the creation of the final result.
Algorithm ML-1. E-class processing int Process_E_class(Header,k) { /* Initialize the hashing table to process kth partition of key */ TableInit(); /* Start from the E-class header */ i = Header; do { /* Hash and link kth partition of ith key */ Markð#; a; i; kÞ; /* Get the next member of E-class, linked in the collision table 
Database Operations
We present implementation of the most frequently used database operations using ordered table hashing operations. Without loss of generality, we assume a relational database model [25] . For example, consider the relation schema:
Employee(EmployeeID, FirstName, LastName, HireDate, Salary) Invoice(Invoice, CustomerID, EmployeeID, Date, Amount)
Database queries are declared using standard query language SQL [26] . We provide here examples of select, multiple select, sort, and join database operations. The select database operation selects a set of objects satisfying given conditions. Very often, the output of the select operation must be ordered on the value of some attribute. For example, selecting and sorting all employees hired after 1 January 1999, could be performed with the following SQL query: Sometimes select must satisfy multiple criteria. In the first example, Query 1 can be changed to find all employees hired after 1 January 1999, with a salary less than $45,000.00. In that case, the WHERE clause will be changed as follows:
WHERE HireDate > #1-JAN-1999# AND Salary<45,000.00
Another important database operation is joining information from two relations. This operation is called join. For example, Query 2 represents selection of all invoices made by employee Smith, ordered by date. A search for a specific record is accelerated using an index, although use of an index introduces an additional processing overhead. We use an ordered table hashing to select and sort fields without building an index over attribute values.
The generality of the OTHER algorithms allows us to decompose complex database operations into a set of primitive operations. As indicated earlier, hashing is often used to perform this decomposition. For the remainder of this paper (for the sake of simplicity without loss of generality), we assume an ideal hashing function with uniform distribution of hash values that can be approximated by the Poisson distribution. It should be noted that, in the proposed algorithms, the single element E-classes do not require further processing. Therefore, it is desirable to generate as many single element E-classes as possible as early as possible during the course of operations. Moreover, note that, in the final stage, the proposed algorithms simply process many short lists of elements, regardless of the size of the original problem. In other words, our algorithms decompose the original problem into a set of smaller problems determined by the number of collisions made by the hashing function. Problem decomposition is significant for operations with nonlinear execution time, such as sorting.
The E-class processing algorithm can be used to synthesize most of the database operations and even introduce some additional optimization. We present the most important applications in the following subsections.
Select
The select database operation is performed as an ordered selection between limits Select(LowLimit, HighLimit). Two types of selection operations are considered: 1) SelectUnique, where records with keys equal to a predefined value are selected, and 2) SelectRange, where a list of records with keys in a predefined range of values are generated. Naturally, conventional hash-based algorithms can be used to perform the SelectUnique function. For the SelectRange function, after the initialization and creation of the hash table, we select only the records with keys within the specified range during the scan phase. For example, Query 1 can be implemented using an ordered select (Algorithm ML-2) on the key HireDate as Select (#1-JAN-1999#, 1) .
Unfortunately, long keys require prohibitively large tables. Therefore, as mentioned earlier, keys are divided into L partitions and the above sequence of operations is performed, recursively, on generated E-classes within the generated range. The following results are the natural byproducts of the Select operation:
. The operation generates a sorted (partially sorted)
result. This could accelerate the execution of the other operations (e.g., aggregation) in the query. In most queries (such as Query 1), it is a desirable feature. . The result of a select operation can be generated by using a few key partitions. . The same initial hash table can be used to generate the results for multiple select operations. For example, in the modified Query 1, we can apply an ordered selection Select(0, 45,000.00) on the result of Select(#1-JAN-1999#, 1).
Minimum/Maximum
The minimum and maximum values in the processed set are obtained by processing only the lowest/highest-order E-classes on every key partition, respectively. The modified E-class processing algorithm uses only the result of GetFirst(0) in ascending order for the minimum and GetFirst(M) in descending order for the maximum, for every key partition a il , 1 l L.
Sorting
Sorting is performed as an ordered selection in the whole domain of hash values D 2 -Selectð0; M À 1Þ on every key partition. Formally, it can be represented as F ðHireDate; #Þ. The sort operation can be accelerated in two ways:
. Accelerate sorting by generating the E-class based on the most significant partition of the keys and then applying a traditional sort algorithm on the elements of each E-class or . Recursively, apply the E-class algorithm on the members of the generated E-classes based on various key partitions.
Duplicate Elimination
If we execute E-class processing on all partitions of the key, all objects in the collision table will be duplicates. Therefore, set É retrieved in the last Scan operation represents the set of objects without duplicate values.
Join
General Â join, where f<; ; ¼; !; >g, is performed as an ordered selection over both relations. Joining of the two relations A and B can be represented as F ðA [ B; #Þ.
EXTENDING THE OTHER ALGORITHMS USING BIT VECTOR PRIMITIVES
Bit vectors are often used to perform or accelerate nonnumeric operations [8] , [27] . Furthermore, the literature has also addressed the application of bit vector operations in distributed systems as a means of reducing the communication cost [28] . As a consequence, the OTHER algorithms are extended by the application of bit vectors as a processing aid to improve performance. The set of values generated by the hashing function # on objects a i , #ða i Þ ¼ # i can be represented using a bit vector: Elements of a bit vector can be accessed in two modes: address access and associative access. The basic OTHER algorithms can be extended by bit vectors and bit vector operations. For example, the hash table is assumed to have an associated bit vector of the same cardinality. Each entry in the hash table has a corresponding bit in the bit vector that represents its status (ith bit =1 indicates that the ith entry in the hash table is occupied). This allows the entries in the hash table to be initialized and scanned rapidly.
The ability of bit vector operations to improve performance also allows one to process larger hash tables. This further reduces the number of collisions and the cardinality of collision sets and, hence, accelerates the proposed algorithms. Bit vectors can be organized and operated as a flat file or, alternatively, as a hierarchical structure. In the hierarchical structure, a W-ary tree structure is organized to allow fast associative access. This organization is of particular interest when one is dealing with large, sparse bit vectors.
Flat Bit Vector Organization
In this organization, a bit vector of N bits is realized as a collection of dN=We words of length W. Each bit b is then referenced by a word number C b ¼ bb=Wc and a relative displacement p within C b , where p ¼ b À ðbb=Wc Ã WÞ. The density of a bit vector is defined as f n ¼ n=N, where n is the number of marked bits in a bit vector of length N. Associative access to the flat bit vector may require, in the worst case, N acc ¼ C b number of accesses. For a small f n , the bit vector scan is inefficient. As a result, we introduce a hierarchical organization to represent and access the bit vector.
Hierarchical Bit Vector Organization
In the case of sparse bit vectors, the hierarchical organization allows fast associative access. It is organized as a W-ary tree over the original bit vector. The height of the tree is defined as
The fast associative access comes at the expense of more memory space. The total amount of memory needed to represent the hierarchical structure is
And, the maximum number of memory references needed for an associative access is
It should be noted that the hierarchical organization requires ðLMax À 1Þ more memory references during the Mark phase operation. Fig. 1 depicts an example of the OTHER sorting process on a database where age is used as the key attribute. A hash . Finally, the application of our hashing technique on generated E-classes based on the second key partition will produce the final sorted result list.
Example

ANALYTICAL MODELING
Analytical modeling is performed to demonstrate the acceleration of database functions when the OTHER algorithms are employed. We assume that the basic primitives are implemented in the hardware and the highlevel operations are synthesized based on the table and bitmanipulation primitives defined earlier. The final result is a list of tuple identifiers (tid) without projection and physical relocation of tuples. Theoretically, maximum acceleration of the proposed algorithms are given as "Zero time" Table Processing (ZTP) performance. It is the performance of an accelerator with infinite speed and private memory which does not require additional system bus cycles. Therefore, every read operation would find the result readily in the accelerator. This section discusses performance improvement of bit vector operations, as well as Select, Sort, and Join operations. Fig. 2 gives possible acceleration of bit vector operations for the vector size of 1Mbit. Operations include vector initialization, mark, and scan primitives. As can be seen, hierarchical bit vector organization is more efficient than flat organization if the table utilization is less than 1 percent. Fig. 3 presents performance improvement. Conventional processors perform bit-vector scans as a sequence of shiftand-test steps, even in the case of dedicated bit operation instructions (Intel's X86 and Motorola's 680X0 families). Therefore, a software algorithm efficiently skips the empty words of a bit vector, but must scan the whole selected word bit by bit. The proposed accelerator performs singlecycle bit scan operations within a processor word. As a consequence, accelerators are the most efficient for a moderate table load factor, when every word in the bit vector has only a few bits marked. As can be seen from Fig. 3 , for both organizations, acceleration depends on hashtable utilization. In addition, in both cases, certain values of load factors offer maximum acceleration. Fig. 4 shows possible acceleration of the SelectUnique for different key partition lengths (M) and the selectivity factor of 1 percent, key length is assumed to be 32 bits. As expected, performance of the select operation is heavily dependent on the selectivity factor. Even for a ZTP operation, performance improves only 20 to 40 percent relative to a traditional implementation. . (15) , the algorithm complexity is still OðNÞ þ Oðn 2 l Þ, where n l is the average E-class length. . One can find an optimal table size for an underlying database that is directly dependent on cardinality (N). . Traditionally, hash-based algorithms can suffer from table hot spots due to uneven distribution of hash values-data skew. As a result, longer E-classes can be generated. However, table-based preprocessing of the OTHER sorting algorithm determines exactly the length of every E-class. Therefore, the optimal sorting algorithm could be applied to sort collisions. Moreover, the low table processing overhead (see Table 1 ) will not significantly increase the processing time, even when all the keys are equal. . An optimal order preserving hashing function for different key value distributions can be found. . The initial order of the keys plays a significant role in existing sorting algorithms, but it is not relevant to the performance of the OTHER sorting algorithm. However, performance of the LIS sort, which is used to sort collisions, is influenced by the initial order among the keys. . As expected, the acceleration ratio grows as the cardinality of the database grows because of faster processing per key and diminishing influence of fixed table processing overhead in the case of the OTHER algorithm. It can be seen in Fig. 5 that theoretic acceleration increases from 10 (N ¼ 50) to 45 (N ¼ 100; 000).
Bit Vector Operations
Select
Sort
Join
As in traditional database management systems, one can perform either a hash-based join or a sort-merge join. Accelerated sort and hash operations accelerate the join operation of the OTHER algorithms. The range of acceleration is similar to the performance of sort operations.
ARCHITECTURE OF THE PROPOSED ACCELERATOR
Our analysis showed that the table processing operations (e.g., table initialization and table scan) in the OTHER algorithms consumes significant amounts of processing time. For example, in the case of sorting, 40 to 95 percent of the processing time is due to the table processing operations depending on the cardinality of the underlying databases.
As a result, we introduced the bit vector and bit vector operations and attempted to accelerate these operations. In order to accelerate the aforementioned primitives in a costefficient manner, we developed an accelerator that can be used either as an extension to the CPU or as a coprocessor.
Bit Manipulation Accelerator (BMA)
The BMA is used in conjunction with conventional random access memory; together, they simulate an associative memory. The proposed accelerator is designed to accelerate three primitives, TableInit, Mark, and GetNext, as discussed in Section 2.
As noted earlier, sparse bit vectors introduce overhead; as a result, we introduced a hierarchical tree organization for the bit vector. The hierarchical structure of the bit vector offers a fast Scan and a constant number of memory accesses per scan cycle; however, these advantages come at the expense of increased hardware complexity of the algorithm and a slower Mark primitive. Consequently, we developed two types of accelerators: The Flat Accelerator (FA) and the Hierarchical Accelerator (HA). Nevertheless, regardless of the implementation, the CPU communicates with the accelerator via the following set of registers, either as a set of I/O or as memory mapped ports:
. ControlRegister (CR) is used to initialize the accelerator to the specific mode of operation (Init, Mark, Scan) and to adjust the size of the bit vector. . BitSetRegister (BSR) accepts the bit address as the argument of the Mark primitive. . Bit Test Register (BTR) is a one-bit read-only register, containing the previous status of the cell which is marked during the Mark primitive. . Scan Register (SR) is a read-only register containing the result of the Scan primitive (i.e., the address of the next marked cell). The Flat Accelerator (FA): Fig. 6a shows the block diagram of the FA scheme. As mentioned before, a bit in the bit vector is referred by the Word Address (WA) and the Bit Address (BA) within the word. Bit vector scan is performed sequentially; consequently, the address generator is realized as a counter. The Word Scan Register (WSR) generates the address of the marked bit within the selected word (BA) during the scan phase. The Scan Register SR then concatenates WA and BA, generating a unique logical address. The FA will not take over the system bus if there are more marked bits in the current index word.
In contrast to the CISC implementation of bit operations, the WSR generates the next BA in the current word during a single cycle. Moreover, the WSR automatically clears the marked bit in the current word to enable further scanning within the word. The FA can scan a range of bit addresses; this feature allows us to select the range of key values or a set of hashed values.
The Hierarchical Accelerator (HA): In addition to the basic functionality of the FA, the HA incorporates a more complicated control logic, address generator, and the scanner to allow hierarchical bit vector operations. The block diagram of the HA is given in Fig. 6b . The scanner contains four index registers for current words in every level of hierarchy. The HA also contains four WSR registers (one for every level of the hierarchy) to allow rapid scanning through the hierarchical tree. For a 1Mbit vector and a word length of 32-bits, this functionality allows us to access a marked bit in fewer than six memory cycles.
Both accelerators were designed and simulated using the TANNER standard cell VLSI package. The complexity of the FA is 1,800 cells and the HA requires 4,300 cells in the case of 32-bit accelerator [13] .
SIMULATION AND COMPARATIVE ANALYSIS
Simulation results of software implementation of OTHER algorithms are given in Table 3 . The execution traces are collected on the DEC Alpha 500au workstation with the DEC Alpha 21164/500MHz processor and Unix 4.0d operating system. The performance of the proposed algorithm is compared with the optimized Quicksort algorithm tuned for execution on the target system [29] . Both algorithms sort pointers to the records rather than physically moving the records. The performance of the algorithm using a table of size M ¼ 64K and M ¼ 1M entries is reported. As can be concluded, OTHER algorithms achieve significant performance improvement, even if implemented in the software. Moreover, table processing consumes 50-95 percent of the processing time in the proposed algorithms. Hence, efficient hardware support for table manipulation operations should significantly improve the performance of the proposed algorithms.
We developed a simulator to evaluate the overall performance of the proposed accelerator based on the operation mix of typical database applications, as reported in the literature. A MIPS-based superscalar CPU with two instructions per cycle is used as the underlying platform [30] . The accelerator could be implemented as the on-chip accelerator tightly coupled to the CPU or an intelligent off-chip bus master. We simulated the on-chip accelerator using special read/write instructions for communication with the accelerator; all instructions that read the result register of the accelerator will be blocked until the results are available. In addition, all subsequent instructions are also blocked. We believe that out-of-order execution, whenever possible, will further increase the performance of OTHER algorithms, allowing useful processing while the CPU is waiting for the result of the accelerator. Execution capability will further increase the performance of OTHER algorithms, allowing useful processing while the CPU is waiting for the result from the accelerator.
The simulator was also extended to compare and contrast the effectiveness of the proposed accelerator against an ideal High-Level Accelerator (HLA). We assumed that the HLA had infinite processing speed and internal memory. The processing time would be just a processor time to write the set of keys and read the result as a list of processed identifiers (tid). Theoretically, the maximum acceleration of the proposed table hashing algorithms was presented as "Zero time" Table Processing (ZTP) performance. It was the performance of the accelerator with infinite speed and private memory which does not require additional system bus cycles.
As anticipated, we found that the performance of the accelerated system is somewhere between the software and ideal implementation of table operations (e.g., ZTP)- Fig. 7 .
The performance depends on table loading factor (N/M). If the optimal table size is chosen, stable performance improvement is achieved, as presented in Fig. 8 . Finally, as noted before, the HA organization offers a better The simulated cache performance of the OTHER sorting algorithm is given in Table 4 . The execution traces were collected on the DEC Alpha 500au workstation with the DEC Alpha 21164/500MHz processor [31] and analyzed using the ATOM cache analysis tool [32] . The simulation conditions were the same as explained in Section 2. The data cache was simulated using an 8KB cache with direct mapping and 32 byte blocks. It is clear that the OTHER algorithm has a significantly lower number of data references per sorted key than the Quicksort algorithm. The best software is achieved when table size is between N=2 and N, where N is the number of processed keys. The data cache miss ratio is similar to that for Quicksort, even with 4 to 8 times fewer data references.
The effect of cache memory on the performance of the proposed accelerator architecture model was also simulated. We assumed a separate two-level instruction and data caches, with access times of one and four processor cycles, respectively. Our simulator also assumed a main memory access time of 16 processor cycles. Finally, it was assumed that all instructions are fetched from the first-level cache. This assumption was mainly due to the relatively small size and repetitive nature of both Quicksort and the proposed OTHER algorithm. Fig. 9 depicts the effect of the cache on the performance of the FA accelerator with single-partition preprocessing for the sort operation based on different cache efficiency (table size M = 256). The relative performance improvement for lower cache efficiency could be explained by the lower number of data references per sorted key and the higher locality of data access (Table 4) . However, such an advantage diminishes when large tables are used to process a small number of keys. Since OTHER algorithms use optimal table size, cache efficiency further increases the relative performance advantage of the proposed solution (Fig. 10) .
We also examined the overall performance improvement of database operations. Our testbed included select, sort, and some typical database queries containing a mixture of database operations [33] , [34] . The choice of the ideal HLA was due to the fact that we intended to show that our scheme achieves a similar performance improvement at the expense of fewer hardware resources. This can be contributed to the generality and effectiveness of the OTHER primitive operations. Fig. 7 shows the acceleration of sort operations. Significant acceleration of select operations could be achieved only for the ideal HLA (up to 80 percent). The proposed accelerators do not achieve significant performance improvement due to the simplicity and regularity of the select operation in the software implementation (FA accelerates the select operation just for 8 percent when N > 2; 000). Fig. 11 presents the overall acceleration of database queries for a typical mixture of database operations [33] , [35] . It should be noted that our simulation analysis also shows that, for a select on multiple attributes, the OTHER accelerator offers superior performance over the HLA.
CONCLUSION
Efficient handling of large databases motivated our research. We intended to develop a simple and cost efficient accelerator for a set of primitive and general nonnumeric operations. As a result, the so-called OTHER primitive operations were introduced. In addition, we demonstrated how complex database functions could be mapped into the OTHER primitives. Analytical and simulation studies were reported to discuss the effectiveness of our approach. We have shown that a comparable performance to an ideal HLA can be achieved by using a very cost efficient accelerator. Finally, different designs for our accelerator were introduced and analyzed.
The proposed algorithms are very efficient, even in software implementation. We have found that the optimal table size for software implementation on the DEC Alpha 500au is between N=2 and N, where N is the number of processed keys. The proposed accelerators make efficient use of small hash tables (Fig. 9) , at the expense of additional memory area for S and C vectors.
It is surprising to find that our flat accelerator (FA), although requiring three times fewer gates to implement than the hierarchical accelerator (HA), in almost all cases demonstrated a higher performance. This suggests that we can use smaller tables and, hence, partition the keys into smaller units, along with incorporating the FA accelerator design.
Veljko Milutinovic (M'81-SM'85) received the PhD degree from the University of Belgrade, Serbia, Yugoslavia, in 1982. He has been with the Department of Computer Engineering, School of Electrical Engineering, University of Belgrade since 1990. Prior to that, he was on the faculty of Purdue University, West Lafayette, Indiana. His research interests are in computer architecture/design, as well as in system support for electronic business on the Internet. He has contributed more than 50 papers to IEEE journals on computer architecture/design and technology-aware system support for missioncritical applications. He has consulted for leading industries in the US and Europe (IBM, RCA, NCR, AT&T, Virtual, eT, Zycad, Aerospace Corporation, Electrospace Corporation, Intel, Fairchild, Honeywell, Encore, Phillips, etc.). He was involved in a number of market successful industrial efforts (designer of the first multiprocessor HF data modem in the 1970s, coarchitect of the first 200MHz RISC microprocessor in the 1980s, project leader of the first RMS system for personal computers in the 1990s). He is the author of several books and was an editor/coeditor for a number of IEEE tutorial books and conference proceedings. Dr. Milutinovic has served as a guest editor for special issues of the Proceedings of the IEEE, IEEE Transactions on Computers, Computer, and IEEE Concurrency. He has presented more than 300 invited talks around the world. He is a senior member of the IEEE. . For more information on this or any computing topic, please visit our Digital Library at http://computer.org/publications/dlib.
