Abstract: In this paper, a scalable high-performance multi-match priority encoder (MPE) for information retrieval is presented. This approach deploys a new design architecture to construct the large-sized MPEs by using an 8-bit priority encoder as a basement. The experiments in an 8-bit MPE, 64-bit MPE, and 2,048-bit MPE prove that the achieved throughputs are 1.5 times, 1.7 times, and 1.4 times as high as those of previous works. Furthermore, a 4,096-bit MPE is fully operational in an information retrieval system and is capable of returning one match per clock cycle. At the operating frequency of 75 MHz, the processing time in worst and best case are 54.6 µs and 0.03 µs, respectively.
[9] C.-H. Huang, et al.: "Design of high-performance CMOS priority encoders and incrementer/decrementers using multilevel lookahead and multilevel folding techniques," IEEE J. Solid-State Circuits 37 ( 
Introduction
The process of identifying and extracting information from a modern database, referred to as information retrieval (IR), has become increasingly important due to the exponential growth of data created by both enterprises and users [1] . To accelerate IR, search data must be indexed in advance. Among many indexing techniques, bitmap index is well-known for handling multiple-condition queries effectively, because of its strong support of parallel processing [2, 3] . Bitmap index returns the query result as a bit sequence, where bit i indicates whether data at address i satisfies all given conditions. As the query length grows up to several thousand bits, making a check of them efficiently plays an essential role in IR. Multi-match priority encoder (MPE) has been widely utilized to obtain all matching bits in a query result to date [4, 5] . Briefly, MPE is comprised of two primary modules, namely a priority encoder (PE) and a preprocessing circuit (PRE). PE resolves the highest priority match of input data and returns this matching position in binary format while PRE masks the detected bits so that the next priority matches will be found in the subsequent cycles. Due to the significance of PE inside MPE, several efforts to an effective PE have been proposed so far.
In general, most previous studies of PE attempted to either improve the conventional architecture or minimize the number of used transistors, thereby reducing latency, resources, and power [6, 7, 8] . These improvements were successfully applied in applications such as incrementers/decrementers [9], comparators [10], and ternary content-addressable memory [11] . However, those architectures are unlikely to support PEs whose size goes up to several thousand bits, since both resource and latency increases drastically.
In this paper, a new approach of constructing a scalable high-performance PE and MPE is proposed. This method employs an 8-bit PE as a basement, from which a large-sized PE is built up. Then, a PRE circuit is inserted in front of PE to form an MPE. The performance analysis shows that our 8-bit MPE, 64-bit MPE, and 2,048-bit MPE outperform other designs regarding achieved throughput and resource utilization. Furthermore, at maximum setting, a 4,096-bit MPE is fully operational in a 75-MHz IR system and is capable of detecting one match per clock cycle. Following this, the processing time in best and worst case are 0.03 µs and 54.6 µs, respectively.
The remainder of this paper is organized as follows. Section 2 briefly presents the background of bitmap index, PE, and MPE. Section 3 describes the hardware architecture of PE and MPE in detail. Section 4 shows the achieved throughput and resources of MPE in comparison with other designs and proves its functionality inside an IR system. Lastly, Section 5 gives the conclusion and future work.
Background and motivation

Bitmap index (BI)
BI is a bit-level matrix that stores the indexes of all search data. Fig. 1 gives a BI example, in which twelve employees were indexed by six attributes, namely Age, Gender, Office, Type, Job, and Salary. The bit at row i and column j turns into one if the employee E i contains the attribute j. To retrieve information of all employees who satisfy the conditions of 'Gender = MALE', 'Type = FULLTIME', 'Job = SALE', and 'Salary > 36,000', we carry out the following steps: (1) perform the bitwise AND operations between four corresponding rows, i.e. Gender, Type, Job, and Salary; (2) look into the query result and check all asserted bits, i.e. E 2 , E 6 , and E 9 ; (3) obtain information at addresses of two, six, and nine. It should be noted that BI is ideal for hardware implementation since the parallel processing of BI is strongly supported by hardware parallelism. The whole throughput of an IR system in this case is likely to increase substantially.
Priority encoder (PE) and multi-match priority encoder (MPE)
PE is used to resolve the highest priority match and output the matching location into binary format, from which corresponding data inside memory are retrieved accurately. Hence, most previous works considered PE as the combination of two primary modules, so-called PRIORITIZER (PRI) and ENCODER (ENC). Fig. 2 (a) depicts an 8-bit PE (PE8) with the input D of 0100110. Since D 1 is the highest priority bit, the output EP and Q would become 01000000 and 001, respectively.
To handle larger input data, several PE8s are joined. This is exemplified by two architectures of 64-bit PE (PE64) shown in Fig. 2(b) and Fig. 2(c) . The output EP and Q are obtained by a set of 8-bit PRIs and 64-bit ENC respectively.
• Fig. 2 PE64 consumes eight times higher resources than a PE8, and in the worst case, the latency of a PE64 is eight times higher than that of a PE8.
• Fig. 2(c) shows a parallel architecture of PE64. PRI 0 to PRI 7 returns its priority match in parallel owning to IE provided by PRI 8 . Despite the improvement in latency, i.e. the latency of PE64 is twice as high as that of a PE8, the resource utilization obviously increases due to the additional PRI 8 and logic gates. MPE is an improvement of PE, which outputs entire matching bits, from highest priority to lowest priority. Most prior approaches inserted PRE in front of PE. In the beginning, the input data are transferred from PRE to PE for encoding. Subsequently, the matching results are sent back PRE. If a match is located, PRE sets the corresponding bit to zero so that in the next cycles, PE will deliver the next matching positions. This process repeats until all matches are found properly.
Motivation
It appears certain that previous designs become inappropriate to large-sized PEs regarding either latency or resources. For instance, in case of cascade architecture, a 4,096-bit PE (PE4K) is likely to require 512 PE8s, and in the worst case, its latency is 512 times higher than that of a PE8. To break these barriers, a new methodology to construct a large-sized PE is presented.
This approach considers a 64-bit input data as a 8-row Â 8-column array, which is illustrated in Fig. 3 . The position i of a bit D i is defined as (1) , where x and y are the indexes of row and column respectively. Because the number of bits in each column is a power of two, or 8 ¼ 2 3 , the multiplier and adder above are replaced by two hardware-friendly operations -SHIFT and bitwise OR operation, as described in (2) . Taking an example of i ¼ 10, according to (2), x ¼ 1 and y ¼ 2, i.e. 10 ¼ ð1 ( 3Þ OR 2. i
Due to this approach, only two PE8s are required to attain the highest priority position of a PE64, i.e. the first PE8 finds the row indexes while the second PE8 finds the column indexes. More significantly, a scalable structure of a large-sized PE can be simply developed in a similar vein by using the results of PE64s. Finally, MPE is formed by combining PRE with PE. Fig. 4 describes the truth table and optimized boolean expression of a PE8. Apparently, this circuit only employs basic operations, i.e. AND, OR, and NOT, which is processed very quickly by hardware logic units. Fig. 5(a) illustrates an architecture PE64 formed by two PE8s, namely PE8_0 and PE8_1. These PE8s are connected in serial and are responsible for determining corresponding row and column. Firstly, 64-bit input data D are split into eight groups, and all of them are put into eight 8-bit OR gates and the 8-to-1 multiplexer (MUX8) simultaneously. Taking an example of D[15:8], DOR [1] become zero unless D[15:8] contains any bit one, and vice versa. Secondly, PE8_0 outputs the row index MR8 to MUX8 so as to select the right groups. Thirdly, PE8_1 receives DMUX and creates the column index MC8. Lastly, Q64 -the position of highest priority bit, is computed by shifting the bits of MR8 left by three bits and then perform the bitwise OR with MC8. Meanwhile, M64 indicates whether matching datum occurs owning to DOR. MUX64 so as to retrieve the corresponding column index MC64. In a similar vein with Q64, the 12-bit matching position Q4K is attained by MR64 and MC64. Simultaneously, M4K is asserted if any match is discovered.
Implementation
Priority encoder circuit
Although this work only focuses on PE64 and PE4K, it should be noted that the proposed scalable design structure is applicable to many other PEs such as a 32-bit PE (PE32) and a 2,048-bit PE (PE2K). For example, a PE32 is formed by one PE8 and one 4-bit PE using the similar architecture shown in Fig. 5(a) . Likewise, a PE2K is developed by 32 PE64s and one PE32 using the same architecture shown in Fig. 5(b) . Fig. 6(a) depicts an MPE4K that is built up from a PE4K and a PRE circuit containing a set of multiplexers and registers (REGs). In the beginning, EN is set to one and input data E is kept in REG. In the subsequent cycles, D -the output of REG, is sent to PE4K. Upon receiving the matching position Q4K, the demultiplexer (DEMUX) converts this value into a 4096-bit clear signal CLR. If CLR[i] is equal to one, REG[i] is set to zero instantly so that PE4K will look for the next priority bit in the following cycles. This procedure repeats until REG turns into zeros completely.
Multi-match priority encoder circuit
Suppose that E contains five matching bits, the simulation waveform of MPE4K is shown in Fig. 6(b) . At the first clock, EN is asserted, and D captures the value of E. During the next five clocks, M4K becomes one and each matching position, i.e. f1, 2, 4, 6, and 4,095g, is returned in turn. Afterwards, M4K changes to zero, which means that all matches are captured completely.
Experimental results
PE and MPE are designed by Verilog HDL, simulated by Modelsim, and validated in several Altera FPGA devices. The achieved throughput is used to evaluate the performance of both designs. Moreover, MPE is proved to be fully operational in an IR system.
Performance analysis
The formula of achieved throughput (THR) is described in (3) , where N is the PE size and FREQ is the operating frequency.
THR ðGbpsÞ ¼ N ðbitsÞ Â FREQ ðMHzÞ 1;000 ð3Þ Table I shows the comparison between our designs and three previous ones -(A), (B), and (C). Since (A), (B), and (C) were verified in an Altera Stratix EP1S10F780C6 and Cyclone IV EP4CE115F29C7, we synthesized our 8-bit MPE (MPE8), 64-bit MPE (MPE64), and 2,048-bit MPE (MPE2K) in the same FPGA devices, so as to draw a fair comparison. Additionally, the results of logic elements (LEs) and FREQ are correspondingly estimated by Altera Quartus II and TimeQuest Timing Analyzer. As seen in Table I , our designs considerably surpass the others in terms of THR. Apparently, MPE8, MPE64, and MPE2K produce THR of 1.5 times (2.9/1.9), 1.7 times (6.1/3.5), and 1.4 times (145.4/102.4) as high as that of (A), (B), and (C), respectively. When it comes to resource utilization, although (A) and (B) did not include ENC circuit, they still consume as high as LEs in comparison with MPE8 and MPE64, respectively. Additionally, (C) requires the number of LEs to be 4.4 times larger than MPE2K, which is likely to cause severe problems of resource utilization and power consumption as N keeps increasing.
Information retrieval system (IRS)
MPE is integrated into an IRS, which is implemented in an Altera Arria V FPGA development kit [12] . IRS is comprised of an ARM system, a data analytics (DA) hardware accelerator, and a multi-port memory controller, as shown in Fig. 7 . Because of the given requirements, DA is configured to operate at 75 MHz with a data width of 256 bits to attain the peak bandwidth utilization of 19.2 Gbps. In the beginning, 4,096 records and 1,024 keys are sent from a host server to SDRAM DDR3 by gigabit Ethernet. Upon completing, ARM system requests DA to activate. Bitmap index creator indexes all records by given keys and stores them in an internal 4;096 Â 1;024-bit memory. Then, query processor executes all given queries with the rate of one query per clock. MPE checks the final query result to obtain all matches and stores them in a dual-width FIFO. This FIFO accumulates the 16-bit matching positions and outputs 256-bit data to the memory controller. Finally, the ARM system returns those results to the host server for information extraction. More detail of bitmap index creator and query processor can be found in [3] and [13], respectively. The results of timing analyzer point out that MPE4K can operate at 87 MHz and therefore deliver THR of 356.4 Gbps. Hence, in a 75-MHz IRS, MPE4K is capable of returning one match per one clock cycle. In worst case, i.e. all bits are ones, the processing time is 4;096 Â ð1=75 MHzÞ ¼ 54:6 µs. In the best case, i.e. all bits are zeros, MPE only requires two clocks to complete, or equivalent to the processing time of 0.03 µs. The retrieved information in host server proves the reliability and performance of MPE.
Conclusion
In this brief, an efficient architecture of a scalable high-performance MPE was proposed. By exploiting this approach, an MPE8, MPE64, and MPE2K could attain THR of 1.5 times, 1.7 times, and 1.4 times as high as THR of previous designs. Furthermore, when being applied in IRS, an MPE4K is capable of detecting entire matching positions within 54.6 µs at the operating frequency of 75 MHz. Future works will evaluate the feasibility of implementing MPE4K in CMOS process concerning power consumption and process latency. The achieved results will be utilized for designing a full data analytics processor.
