Abstract--The coming of multimedia era and information security era indicates that must process longer digit integer data. Previous sort researches focus on pure performance of large amount of finite fixed digit/bit number. This paper discusses on effectively solving arbitrary long digit integer sorting problem by HW/SW co-design under the Area×Time 2 (AT 2 ) price-performance constraint. The work proposes multi-level (two-level) sort architecture to attain the object: an accomplished fixed-digit (k-bit) hardware sorter implements the first or basic level sorting, software programmed radix 2 k sort implements the second or higher level sorting. By Super Radix Sorting HW/SW co-design and reuse techniques, the work makes fixed-digit HW sorters more flexible and useful.
INTRODUCTION
Sorting is one of the most important problems in computer science. Many fundamental processes in computing and communication systems require sorting of data. Sorting network play a key role in the areas of parallel computing, multi-access memories and multiprocessing [3] , [4] , [5] , [6] , [11] , [13] , [14] , [19] .
Compare and swap elements of data are vital for sorting, as depicted in Fig. 1 . But if someone needs to process very long digit integer sorting, then directly design a corresponding digit integer hardware sorter, the comparators and networks will become very huge. The circuit schematics of 1, 2, 4, and 16 bit magnitude comparators are depicted in Fig. 2 . And about bus, if it is designed for 32-digit integer, every bus represents 32-bit line. And if it is designed for 64-digit integer, every bus represents 64-bit line. That means it needs double wire structures and areas.
More importantly, circuit cost/complexity of a (2k)-bit comparator are not only twice than a k-bit comparator, as shown in Table 1 . Also, the ability of CMOS circuit fan-out is limited; it still needs to add some additional buffers in the comparator circuits. Table 1 . Circuit cost/complexity of a long bit/digit comparator are more higher than a short bit/digit one.
In Table 2 , some sorter chip designs had shown hardware expandable properties [1] , [2] , [8] . But they are not good enough for arbitrary long digit integer sorter design. The time performance of a fixed-digit (k-bit) hardware sorter is often better than a same digit software sort program, as displayed in Table 3 . But a pure hardware sorter still has higher area cost and some restrictions, so it is not popular yet on common commercial CPUs.
Base on the physical considerations, the author focuses on effectively solving arbitrary long digit integer sort problem by HW/SW co-design under Area-Time
2

(AT
2 ) cost-performance trade-off constraint [20] , [21] . Several AT 2 -optimal sorting networks under different word length models have been proposed in [7] , [9] , [15] , and [17] .
For embedded systems, a uniprocessor software solution is often not applicable due to the insufficient I/O and performance, while realizing multiprocessor sorting methods on parallel computers is much too expensive with respect to area cost and power consumption.
When the trends of data processing migrate from 32-bit to 64-bit, 128-bit or uncertainly higher, a fixed-digit pure HW sorter cannot content demands alone. All of the sorting algorithms or circuits in this paper are based on commonly known algorithms and structures. But make an accomplished hardware sorter reusable [12] , make a pure HW sorter more flexible and balance its cost-performance, are very valuable and necessary. This paper is organized as follows. Section 2 briefly introduces the basic LSD radix sort algorithm. Then a cost-benefit balanced multi-level (two-level) HW/SW mixed sort architecture is given and discussed in Section 3. Finally conclude the major findings and outline the future work.
II. STRAIGHT RADIX SORT ALGORITHM
This approach begins with the least significant key first, and is known as LSD (Least Significant Digit) sort. Following the sort on a key, the piles are put together to obtain a single pile that is then sorted on the next significant key. This process is continued until the pile is sorted on the most significant key [13] . And the sorted sequence is obtained.
Complexity As shown in Fig. 3 , it takes n steps to put all the elements in queue AUX, and d steps to initialize the queues Q Table 3 . Area-Time Bounds for the finite and fixed bit/digit number sorting problem [21] . Figure 3 also displays an LSD radix-10 sorting example using linked allocation [9] . But when the radix is very large, linked list allocation will become ineffective.
III. A MULTI-LEVEL MIXED ARCHITECTURE: SUPER RADIX SORT
From this example, the benefits of LSD radix sort are directly unfolded: (1) the key size can be changed easily; (2) there is no recursive function call, no stack size problem. For solving arbitrary long digit integer sorting problem under cost-performance trade-off constraint, the LSD radix benefits will be extended to the utmost edge.
And because of very long digit integer, using bit field structure to reduce memory requirement, and accelerate sort process, is necessary. As depicted in Fig. 4 , a twolevel HW/SW mixed sort architecture are proposed: an accomplished fixed-digit (k-bit) hardware sorter implements the first/bottom level sorting, software programmed LSD-radix (radix 2 k ) sort implements the second/higher level sorting by way of CPU. Thus sort operation will appear in assembly codes, as Fig. 5 shows.
It can directly handle maximum 2 32 × k-digit integers sorting job (if 32 is the length of common register). If k=16, it can handle max 2 32 × 16 digit integer sorting job. If the number of digit is still higher then the quota, similar multi-level mixed sort architecture can be considered. Of course, if the input sequence is also arbitrary long, some special design have provided solutions [24] . Or the sequence is separated into several pieces, and then merges them to get the total result after sorting.
Because the bit length of numbers is very long, compare two numbers than directly swap them is very ineffective [18] . An indirect method --only record swapped indices and hold them in cache is a good idea.
If the system only has an common CPU and the bits of the longest number is m, and the sort algorithm is radix sort, the average overall running time of the proposed method is m × O(N). But if the system has an accomplished fixed-digit (k-bit) hardware sorter on the system and the bits of the longest number is m, the overall running time of the proposed method becomes m / k × Td. If the HW sorter is N (log 2 N) 2 -Comparator Bitonic Sorter, the overall running time is m / k × O ( (log 2 N) 2 ). Some comparisons are shown in Table 4 .
The proposed HW/SW mixed super radix sorting architecture can process and change HW/SW partitioning ratio easily, as displayed in Fig. 6 , to get a cost-benefit balanced flexible HW/SW mixed design. And the accomplished fixed-digit (k-bit) hardware sorter can choose any your favor or your own design. 
Output: A[ ] ( the array in sorted order). *) begin
Assume that all elements are initially in a auxiliary queue AUX; (* The use of AUX is for simplicity; it can be implemented by Array A *) 2 32 ) sort with 32-bit HW sorter mixed sorting, it needs 3 steps. And it is processed by 88-digit integer SW LSD Radix-65,536 (Radix-2 16 ) sort with 16-bit HW sorter mixed sorting, it will needs 6 steps. If the hardware sorter can be easily decomposed to several stages then pipeline, the hardware sorter can get more higher hardware sharing and throughputs, as Fig. 8 [1] 00000110 0100000010010000 1010000000010100 0010101000000000 0001000000101000 0001010101010000
[2] 10101000 0010101010101000 0000101010001010 0100000010101011 0000000000000010 0000000000000000 [3] 00000000 0000010111110000 0000000000000000 0000000010101010 0000010101000000 0101100000001100 [4] 00000011 0111100000000100 1100000011110000 0001110000010010 0000000000000000 0000000000000010 [5] 00000000 0000000000000000 0000000000000000 0000010100001000 0000000000000000 0000001100110011 [6] 00000000 0000000000000000 0000000000000001 0000100010000000 0000000000000100 0010100000010000 [7] 00000000 0000000000000001 0001010000001000 0000000100101000 0000000101000000 0000000000111000 [8] 00000000 0000000000000000 0000000100100000 0000001000010100 0010001000101000 0000001010000100 [9] 00000000 0001000010101000 0000100001110000 0000100000011000 0000000000001111 0000000001100000
Step 1: [4] 00000011 0111100000000100 1100000011110000 0001110000010010 0000000000000000 0000000000000010 [5] 00000000 0000000000000000 0000000000000000 0000010100001000 0000000000000000 0000001100110011 [2] 10101000 0010101010101000 0000101010001010 0100000010101011 0000000000000010 0000000000000000 [6] 00000000 0000000000000000 0000000000000001 0000100010000000 0000000000000100 0010100000010000 [9] 00000000 0001000010101000 0000100001110000 0000100000011000 0000000000001111 0000000001100000 [7] 00000000 0000000000000001 0001010000001000 0000000100101000 0000000101000000 0000000000111000 [3] 00000000 0000010111110000 0000000000000000 0000000010101010 0000010101000000 0101100000001100 [1] 00000110 0100000010010000 1010000000010100 0010101000000000 0001000000101000 0001010101010000 [8] 00000000 0000000000000000 0000000100100000 0000001000010100 0010001000101000 0000001010000100
Step 2: [3] 00000000 0000010111110000 0000000000000000 0000000010101010 0000010101000000 0101100000001100 [5] 00000000 0000000000000000 0000000000000000 0000010100001000 0000000000000000 0000001100110011 [6] 00000000 0000000000000000 0000000000000001 0000100010000000 0000000000000100 0010100000010000 [8] 00000000 0000000000000000 0000000100100000 0000001000010100 0010001000101000 0000001010000100 [9] 00000000 0001000010101000 0000100001110000 0000100000011000 0000000000001111 0000000001100000 [2] 10101000 0010101010101000 0000101010001010 0100000010101011 0000000000000010 0000000000000000 [7] 00000000 0000000000000001 0001010000001000 0000000100101000 0000000101000000 0000000000111000 [1] 00000110 0100000010010000 1010000000010100 0010101000000000 0001000000101000 0001010101010000 [4] 00000011 0111100000000100 1100000011110000 0001110000010010 0000000000000000 0000000000000010
Step 3: [5] 00000000 0000000000000000 0000000000000000 0000010100001000 0000000000000000 0000001100110011 [6] 00000000 0000000000000000 0000000000000001 0000100010000000 0000000000000100 0010100000010000 [8] 00000000 0000000000000000 0000000100100000 0000001000010100 0010001000101000 0000001010000100 [7] 00000000 0000000000000001 0001010000001000 0000000100101000 0000000101000000 0000000000111000 [3] 00000000 0000010111110000 0000000000000000 0000000010101010 0000010101000000 0101100000001100 [9] 00000000 0001000010101000 0000100001110000 0000100000011000 0000000000001111 0000000001100000 [4] 00000011 0111100000000100 1100000011110000 0001110000010010 0000000000000000 0000000000000010 [1] 00000110 0100000010010000 1010000000010100 0010101000000000 0001000000101000 0001010101010000 [2] 10101000 0010101010101000 0000101010001010 0100000010101011 0000000000000010 0000000000000000
Step Table 5 . New design has high hardware reusing.
IV. CONCLUDING REMARK
This paper discusses on effectively solving arbitrary long digit integer sorting problem by HW/SW co-design under Area x Time 2 (AT 2 ) price-performance constraint. The work introduces a two-level (multi-level) sort architecture can attain the object: an accomplished fixeddigit (k-digit) hardware sorter implements first level sorting, software programmed LSD radix (radix 2 k ) sort implements second level sorting.
As Table 5 shows, by HW/SW co-design and reuse methodology, the proposed mixed super radix sorting architecture makes accomplished hardware sorters more flexible and useful: It is time to put a hardware sorter on a common commercial CPU or network processor.
