Abstract-The computation model on which the algorithms are developed is the reconfigurable array of processors with wider bus networks (abbreviated to RAPWBN). The main difference between the RAPWBN model and other existing reconfigurable parallel processing systems is that the bus width of each network is bounded within the range ½2; d ffiffiffiffi ffi N p e. Such a strategy not only saves the silicon area of the chip as well as increases the computational power enormously, but the strategy also allows the execution speed of the proposed algorithms to be tuned by the bus bandwidth. To demonstrate the computational power of the RAPWBN, the channelassignment problem is derived in this paper. For the channel-assignment problem with N pairs of components, we first design an OðT þ d N w eÞ time parallel algorithm using 2N processors with a 2N-row by 2N-column bus network, where the bus width of each bus network is w-bit for 2 w d ffiffiffiffi ffi N p e and T ¼ blog w Nc þ 1. By tuning the bus bandwidth to the natural log N-bit and the extended N 1=c -bit (N 1=c > log N) for any constant c and c ! 1, two more results which run in Oðlog N= log log NÞ and Oð1Þ time, respectively, are also derived. When compared to the algorithms proposed by Olariu et al. [17] and Lin [14], it is shown that our algorithm runs in the equivalent time complexity while significantly reducing the number of processors to OðNÞ.
Abstract-The computation model on which the algorithms are developed is the reconfigurable array of processors with wider bus networks (abbreviated to RAPWBN). The main difference between the RAPWBN model and other existing reconfigurable parallel processing systems is that the bus width of each network is bounded within the range ½2; d ffiffiffiffi ffi N p e. Such a strategy not only saves the silicon area of the chip as well as increases the computational power enormously, but the strategy also allows the execution speed of the proposed algorithms to be tuned by the bus bandwidth. To demonstrate the computational power of the RAPWBN, the channelassignment problem is derived in this paper. For the channel-assignment problem with N pairs of components, we first design an OðT þ d N w eÞ time parallel algorithm using 2N processors with a 2N-row by 2N-column bus network, where the bus width of each bus network is w-bit for 2 w d ffiffiffiffi ffi N p e and T ¼ blog w Nc þ 1. By tuning the bus bandwidth to the natural log N-bit and the extended N 1=c -bit (N 1=c > log N) for any constant c and c ! 1, two more results which run in Oðlog N= log log NÞ and Oð1Þ time, respectively, are also derived. When compared to the algorithms proposed by Olariu et al. [17] and Lin [14] , it is shown that our algorithm runs in the equivalent time complexity while significantly reducing the number of processors to OðNÞ.
Index Terms-Channel-assignment problem, minimum coloring problem, interval graph, list ranking, integer sorting, parallel algorithm, reconfigurable array of processors with wider bus networks. 
R
ESEARCHERS have shown that the computation power of a single processor cannot continue to be increased without limit. Thus, while designing more powerful computation machines, it is best to also develop parallel processing machines. Because of its simplicity and regularity in architecture, the mesh-connected computer is one of the most famous parallel processing systems. With the advance of VLSI techniques, it is quite suitable to be implemented by interconnection networks [3] , [6] . Unfortunately, both fixed architecture and communication locality are two inherent drawbacks of the mesh-connected computer. Researchers overcome these two drawbacks by equipping it with a reconfigurable bus system.
A reconfigurable parallel processing system can be defined as a set of processors connected to a reconfigurable bus system, the configuration of which can be dynamically established at runtime. There are many varieties of this kind of machine including the reconfigurable mesh (RMESH) [16] , the polymorphic torus architecture [12] , [15] , the processor array with a reconfigurable bus system (PARBS) [30] , and the reconfigurable array of processors [8] . For the reconfigurable bus system, some researchers use optical buses instead of electrical buses [13] , [32] . In this paper, we focus on the use of electrical buses. Due to the reconfigurability of the bus system, many problems can be solved in constant time. A VLSI chip called YUPPIE (Yorktown Ultra Parallel Polymorphic Image Engine) [12] , [15] , has been implemented to realize these proposed models.
Usually, the more processors that are used in a system, the better the execution time of a parallel algorithm. Interestingly, the running time of an algorithm also can be improved by using a wider bus system rather than by using more processors. According to the experimental results that Li and Maresca have shown [12] , [15] , increasing processor silicon area by 20 percent enables each processor to control the local switch between the processor and the buses at the instruction level. This implies that it would be more efficient to save silicon area of the chip by increasing the bus capacity rather than by increasing the processor complexity. Based on this, three improved models have been proposed. They are the reconfigurable multiple bus machine (RMBM) [28] , the distributed memory bus computer (DMBC) [22] , and the reconfigurable array of processors with wider bus networks (RAPWBN) [9] , [11] . The minor difference between the RAPWBN model and the other two models is that the bus width of each bus network of the former can be tuned within the range ½2; d ffiffiffiffi ffi N p e. The channel-assignment problem is an important fundamental problem and has many practical applications in computer-aided design [18] , [21] , [27] , [29] , [31] . The problem is defined as follows: Given a two-sided printed circuit board, we assume there are horizontal lines called channels on one side, and vertical lines on the other. There are N pairs of components, where each pair of components is to be placed on a specific vertical line and connected by a horizontal line segment. An example for eight pairs of components is shown in Fig. 1 . No two pairs of different components can overlap either vertically or horizontally as shown in Fig. 2 . If their connections do not illegally overlap, the pairs of different components can be shared on the same channel. Thus, the channel-assignment problem is to assign these components into channels in such a way that no illegal overlapping occurs and the number of channels is minimized. Fig. 3 shows the channel-assignment of the eight pairs of components referenced in Fig. 1 . This problem also relates to the minimum coloring problem on interval graphs as shown in [5] , [14] , [17] , [33] , [34] . A brief description of the minimum coloring problem on interval graphs follows.
Let N pairs of components correspond to N intervals on a real line, where both the left and the right endpoints of each interval correspond to the positions of a pair of components. Then, the minimum coloring problem is to assign a color to each interval such that the overlapping intervals have distinct colors. For example, Fig. 4 shows the relationship between intervals and components corresponding to Fig. 1 .
The channel-assignment problem (or the minimum coloring problem on interval graphs) has been studied extensively by many researchers. Gupta et al. [5] gave an OðN log NÞ time sequential algorithm to solve this problem. The complexity of their algorithm can be reduced to OðNÞ if the intervals are sorted. Clearly, both sorted and unsorted cases are cost optimal. Dekel and Sahni [1] gave an Oðlog NÞ time parallel algorithm for this problem on the EREW (exclusive read exclusive write ) PRAM (parallel random access machine) model using OðN 2 Þ processors. Using a different approach, Savage and Wloka [23] , Yu et al. [33] and Yu and Yang [34] gave an Oðlog NÞ time parallel algorithm for this problem on the EREW PRAM model each using OðNÞ processors. The number of processors used by these algorithms can be reduced to OðN= log NÞ if the intervals are sorted. Clearly, both sorted and unsorted cases are cost optimal. On the reconfigurable mesh, Olariu et al. [17] gave an Oð1Þ time parallel algorithm for this problem using N 2 processors. Lin [14] also gave an Oð1Þ time parallel algorithm for this problem on the processor arrays with reconfigurable bus systems using OðN 2 Þ processors. Note that both architectures require OðN 2 Þ local switches and a bus width of log N-bits.
In this paper, we are interested in using the reconfigurable array of processors with wider bus networks to develop parallel algorithms for the channel-assignment problem. Instead of using the direct method as proposed by Olariu et al. [17] and Lin [14] , we reduce the channelassignment problem to the list ranking problem. In this representation, the connected components that belong to the same linked list will be assigned to the same channel. The parallel computation model upon which our algorithms are based is the reconfigurable array of processors with wider bus networks [9] , [11] . For completeness, we review some basic operations including the prefix sums algorithm, the list ranking algorithm proposed by Kao et al. [8] and Kao and Horng [9] , and then derive an Oð1Þ time integer sorting algorithm using fewer processors than known before [28] . Suppose the bus width is N 1=c -bit (N 1=c > log N) for any constant c and c ! 1. Based on these proposed operations, a constant time optimal algorithm for the channel-assignment problem using 2N processors with a 2N-row by 2N-column bus network is developed. Although the bus width in the proposed algorithms is restricted to N 1=c -bits (N 1=c > log N) for any constant c, practically, we can assume it to be log N-bits (Oðlog NÞ) for any constant . Under such assumptions, the parallel architecture where our algorithms are developed uses the same local switch complexity (OðN 2 Þ) and bus complexity (Oðlog NÞ) as those used in Olariu et al. [17] and Lin [14] . Comparing the algorithm to those of two similar architectures [17] , [14] , our algorithm attains the same time complexity and uses considerably less processors (OðNÞ). If we consider only the time and processor complexities, then the algorithm is optimal. This makes sense considering the fact that a switch uses much less hardware than a processor does. This assumption has been used in the literature [7] . Of course, if we include switch complexity as well, our algorithms are still not optimal.
The rest of this paper is organized as follows: We first discuss the reconfigurable array of processors with wider bus networks upon which our algorithms are based in Section 2. Section 3 reviews some basic operations and derives an integer sorting algorithm. Section 4 derives the parallel algorithm for the channel-assignment problem. Finally, some concluding remarks are included in the last section.
THE COMPUTATION MODEL
In this section, we shall discuss the architecture of the reconfigurable array of processors with wider bus networks (RAPWBN) which is the computation model adopted in this paper. For the sake of convenience, we follow the notation which were proposed in the literature [9] , [11] .
A linear RAPWBN of size N contains N processors embedded in an M-row by N-column bus network. Each processor is identified by a unique index denoted P j , 0 j < N, and the bus width of each bus network is usually assumed to be N 1=c -bit, where N is the number of processors and c is a constant and c ! 1. For convenience, we assume N 1=c ¼ w, where w is an integer. The M-row by N-column bus network has 2MN ports denoted by ÀS i;j ; þS i;j and each port has w-bit bus connection switches denoted by ÀS i;j ðkÞ; þS i;j ðkÞ for 0 i < M; 0 j < N and 0 k < w. The ith-row bus, 0 i < M, connects the jth-column port switch þS i;j to the ðj þ 1Þth-column port switch ÀS i;jþ1 for 0 j < N À 1. Each processor P j also has a column bus with M ports located at the local switches and denoted by ]S i;j and each port has w-bit bus connection switches denoted by ]S i;j ðkÞ for 0 i < M and 0 k < w. The w-bit column bus of a processor can be connected to any row bus by setting the port connection switches ]S i;j ðkÞ to ÀS i;j ðkÞ and/or þS i;j ðkÞ for 0 i < M, 0 j < N and 0 k < w. Note that, for the hardware implementation, the M ports located at the local switches can be controlled by a port within a processor with a log M-bit address for port identification.
Any configuration of the bus system is derivable by properly establishing the local connection among the data bus of each port within each processor. To represent the local connection within each processor, we use the notation fg 0 g; fg 1 g; . . . ; fg t g, where g i , 0 i t À 1, denotes a group of buses that are connected together. In Fig. 5a , we show an example where we set the port connections to f]S i;0 ðkÞ, ÀS i;0 ðkÞ, þS i;0 ðkÞ, 0 i 2, 0 k 1g, for a linear RAPWBN of size 4 with a 3-row by 4-column bus network, where the bus width of each bus network is 2 bits. Like the mesh-connected computer, the RAPWBN is also scalable. Suppose an RAPWBN of size N with an M-row by N-column bus network is fabricated into two chips; one for the N processors named as PC, the other for the M-row by N-column bus network named as BNC. Based on the fabricated chips, an RAPWBN of size 2N with an 2M-row by 2N-column bus network can be constructed straightforwardly; that is, the 2M-row by 2N-column bus network can be constructed from two BNCs like the mesh-connected computer, two PCs are then connected on the top of the constructed BNCs. See Fig. 5b for example. Fig. 6 depicts six interesting switch configurations derivable from a processor of an RAPWBN. For simplicity, if each bit of the ith-row bus is connected to the jth-column bus one by one, we use the abbreviated representation fÀS i;j ; þS i;j ; ]S i;j g instead of the defined representation fÀS i;j ðkÞ, þS i;j ðkÞ, ]S i;j ðkÞ, 0 k w À 1g. Note that, to justify the hardware cost of a linear N RAPWBN with an N-row by N-column bus network, a practical measuring result of the VLSI chip of the reconfigurable mesh, called the YUPPIE (Yorktown Ultra Parallel Polymorphic Image Engine) implemented by Maresca and Li [12] , [15] , could be used. The YUPPIE VLSI chip was fabricated with 2 CMOS technology. It contained 16 processors arranged in a 4 Â 4 mesh each with the switch function. Each processor was equipped with a one-bit ALU, five registers and 256 bit memory. The final chip size was 5:0 Â 6:5 mm (excluding I/O pads) and the chip had 68 pins. The chip area was divided into three main blocks: 24 percent for the memory block, 38 percent for the processor block and 12 percent for the switch function and mesh wires. According to the design, the switch function and mesh wires took roughly one fifth of the processor and the 256-bit memory silicon area (i.e., 12:62). The ratio indicated that the hardware cost of the switch function and mesh wires were fairly low. Based on the experimental result, the hardware cost of the switches and wires of a 2D N Â N reconfigurable mesh can be roughly estimated as 0:12N 2 log 2 N and that of a linear N RAPWBN with an N-row by N-column bus network is 0:12N 2 N 2=c roughly for N 1=c -bit bus width. For example, assume that there are 2 32 processors and each has a data of size 32-bit.
40:32 and the bus width w is extended to 41-bit for the RAPWBN. For the switches and wires, the hardware cost ratio between N 1=c -bit (RAPWBN) and log N-bit (reconfigurable mesh) for the above example would be
It indicates that the chip area of the switches and wires of a linear RAPWBN will be increased by one and half times relatively to that of the 2D reconfigurable mesh. On the other hand, the hardware cost of the processors of a 2D N Â N reconfigurable mesh can be similarly estimated as 0:62N 2 and that of a linear N RAPWBN with an N-row by N-column bus network is 0:62N for N 1=c -bit bus width. For the processors, the hardware cost ratio between RAPWBN and reconfigurable mesh for the above example would be
It indicates that the chip area of the processors of a linear RAPWBN can save 2 32 times relative to that of the 2D reconfigurable mesh. Hence, the wider bus of the system installed, the more powerful parallel processing system created. As stated before, the area occupied by the extended bus relative to the newly created chip is limited. Therefore, it would be more efficient to save silicon area of the chip by increasing the bus capacity rather than by increasing the processor complexity.
For a unit of time, we assume each processor can perform one of the following operations: execute one arithmetic or logic operation, access a local memory word of size m, set the local switches with the same connection configuration on the same column bus, broadcast a data on the established bus, or receive data from the established bus. Given that we have M-rows, to set M local switches in
M different configurations on the same column bus requires M 0 units of time. We allow multiple processors to broadcast data on different buses or to broadcast the same data on the same bus simultaneously in a time unit if there is no collision.
An RAPWBN is operated on a SIMD (single instruction stream, multiple data streams) model. The bus width is not unlimited between processors. For transferring an w-bit datum between processors in constant time, we assume the bus width is bounded by w-bit as stated before, where w is an integer. The I/O loading (download and upload) time is fully dependent on how complex the I/O interface between processors and peripherals will be. It is difficult to estimate accurately how much I/O time should be included in the time complexity of an algorithm. Therefore, the time complexity of each algorithm is assumed to be the sum of the maximal computation time among all processors and the communication time among all processors. This assumption was also used by many researchers [8] , [9] , [11] , [12] , [14] , [15] , [16] , [17] , [22] , [28] , [30] .
BASIC OPERATIONS
In this section, several data operations, which have been proposed on the RAPWBN, are summarized. These data operations will be used to solve the channel-assignment problem in the next section. For the sake of completeness, we shall first review the prefix sum computation proposed by Kao et al. [8] and the list ranking algorithm proposed by Kao and Horng [9] in detail. Then, an Oð1Þ time integer sorting algorithm is derived.
The Prefix Sum Computation
Let B ¼ fb 0 ; b 1 ; . . . ; b NÀ1 g be a Boolean data sequence of size N. The prefix sum of these N Boolean data, denoted as ps j , is defined as follows:
where b i 2 f0; 1g and 0 j < N. From (1), the maximum sum of the ps j is at most N. The ps j can be represented by the base-w number system as follows:
where a j;h is the coefficient of the hth digit of ps j , 0 a j;h < w, T ¼ blog w Nc þ 1, and 0 j < N. The prefix sum of these N Boolean data can be summed up as follows: Instead of computing (1), we can first compute the coefficient a j;h for each ps j and then each ps j can be computed by (2) . Assume that the Boolean data sequence b j is initially stored in the local variable bðjÞ of processor P j , 0 j < N. Finally, the prefix sum ps j is stored in the local variable psðjÞ of processor P j , 0 j < N. For the sake of completeness, the prefix sum algorithm (PSA) as proposed by Kao et al. [8] is shown in the following. Assume B ¼ f0; 1; 1; 1; 0; 1; 0; 1g and the bus width is 3 bits, where N ¼ 8 and w ¼ 3. An illustration of algorithm PSA is also shown in Fig. 7 .
Algorithm PSA Input: bðjÞ, 0 j < N. Output: psðjÞ, 0 j < N. 0: begin 1: Processor P j , 0 j < N, copies bðjÞ to tpðjÞ. 2: repeat Steps 2.1-2.3 from h ¼ 0 to T À 1 2.1: // Set the switch connection. // Processor P j , 1 j < N, establishes the local connection f]S 0;j ðrÞ; ÀS 0;j ðrÞ, þS 0;j ðrþ1Þ mod w, 0 r < wg if tpðjÞ ¼ 1; f]S 0;j ; ÀS 0;j ; þS 0;j g, otherwise. Then, Processor P 0 , establishes the local connection f]S 0;0 ðrÞ, þS 0;0 ðr þ 1Þ mod w, 0 r < wg if tpð0Þ ¼ 1; f]S 0;0 ; þS 0;0 g, otherwise.
Then, processor P 0 writes a signal "1" on the bit 0 of the established bus from the port ]S 0;0 . 2.2: // Identify the coefficient a j;h of the hth digit of each ps j . // Processor P j , 0 j < N, sets aðjÞ½h ¼ ðr þ tpðjÞÞ mod w if it can read the signal "1" from the bit r, 0 r < w, of the port ]S 0;j through the established bus, where aðjÞ½h denotes the hth memory location of the local variable aðjÞ of processor P j . 2.3: // Identify the carry. // Processor P j , 0 j < N, sets tpðjÞ ¼ 1 if tpðjÞ ¼ 1 and aðjÞ½h ¼ 0; sets tpðjÞ ¼ 0, otherwise. 3: Processor P j , 0 j < N, applies (2) on aðjÞ½h, 0 h < T, to obtain psðjÞ. 4: end Lemma 1 [8] . Given N Boolean data, the prefix sum of these N data can be computed in OðT Þ time on a linear RAPWBN using N processors with one row bus network, where the bus width of each bus network is w-bit for 2 w d ffiffiffiffi ffi N p e and T ¼ blog w Nc þ 1.
Based on Lemma 1, algorithm PSA can be run in Oðlog N= log log NÞ time under the Oðlog NÞ-bit natural bus width. In particular, if the bus width is extended to N 1=c -bit (for N 1=c > log N), where c is a constant and c ! 1. Then,
is also a constant. This leads to the following two corollaries.
Corollary 1 [8] . Given N Boolean data, the prefix sum of these N data can be computed in Oðlog N= log log NÞ time on a linear RAPWBN using N processors with one row bus network, where the bus width of the bus network is log N-bit.
Corollary 2 [8] . Given N Boolean data, the prefix sum of these N data can be computed in Oð1Þ time on a linear N RAPWBN with one row bus network, where the bus width of the bus network is N 1=c -bit ðN 1=c > log NÞ for any constant c and c ! 1.
The List Ranking Problem
Assume there is a linked list a 0 ; a 1 ; . . . ; a NÀ1 with a i following a iÀ1 in the list, and the definition of the list ranking problem is asked to find the rank for each element of the list. The list ranking problem discussed here is the data dependent version. Only the location of the first element is given along with a map from the ith element to the ði þ 1Þth element. Assume element a i is contained in the jth processor. Then, element a iþ1 will be contained in processor nextðjÞ, where nextðjÞ is a link from element a i . If nextðjÞ ¼ nil, then element a i is the tail of the list and the rank of it is assumed to be N À 1.
The main idea of the algorithm proposed by Kao and Horng [9] is to reduce the list ranking problem to the binary sequence prefix sum problem. It can be described by the following two steps. First, establish the local connection of the bus system on an RAPWBN according to the position of the next element. Then, compute the rank of each element in the linked list by performing the prefix sum along the established bus. Assume the head of the linked list is a 0 . Initially, the linked list a 0 ; a 1 ; . . . ; a NÀ1 of N elements, with a iþ1 following a i , are stored in processors P j and P nextðjÞ , respectively. That is, the link of element i is stored in the local variable nextðjÞ of processor P j . Finally, the rank of element a i is stored in the local variable rankðjÞ of processor P j . Let a 0 be stored in processor P 0 . We simplify the list ranking algorithm proposed by Kao and Horng [9] in the following. Assume nextð0Þ ¼ 1, nextð1Þ ¼ 3, nextð2Þ ¼ nil, nextð3Þ ¼ 2, and the bus width has 4 bits (i.e., w ¼ 4). An illustration of computing the rank of each element of the linked list is shown in Fig. 8 .
Algorithm LRA; Input: nextðjÞ; 0 j < N. output: rankðjÞ; 0 j < N.
0: begin 1: // Establish the local connection. // Processor P j , 0 j < N, establishes the local connection f#S i;j ; ÀS i;j ; þS i;j ; 0 i < Ng; then processor P j with nextðjÞ 6 ¼ nil, 0 j < N, establishes the local connection f#S j;j ; ÀS j;j ; þS j;j g and f#S nextðjÞ;j ðkÞ; ÀS nextðjÞ;j ððk þ 1Þ mod NÞ; þS nextðjÞ;j ððk þ 1Þ mod NÞ; 0 k < Ng. 2: // Compute the rank of each element. // 2.1: Processor P j with holding the data a 0 , 0 j < N, writes a signal "1" to the bit 0 of the established bus on its port ]S j;j . 2.2: Processor P j , 0 j < N, sets rankðjÞ ¼ r if it can read the signal "1" from the bit r of the established bus on its port ]S j;j (Note that it can be implemented by the BSF instruction of the Intel 80 x 86 family architecture). 3: end Lemma 2 [9] . Algorithm LRA can be performed in Oð1Þ time on a linear N RAPWBN with an N-row by N-column bus network, where the bus width of each bus network is N-bit. Like algorithm PSA, by using base-w number representation and properly establishing the local connection, algorithm LRA can be easily modified to run in OðT Þ, Oðlog N= log log NÞ and Oð1Þ time when the bus width is assumed to be w-bit, log N-bit, and N 1=c -bit, respectively. Hence, this leads to the following three corollaries.
Corollary 3 [9] . The list ranking problem can be solved in OðT Þ time on a linear N RAPWBN with an N-row by N-column bus network, where the bus width of each bus network is w-bit
Corollary 4 [9] . The list ranking problem can be solved in Oðlog N= log log NÞ time on a linear N RAPWBN with an N-row by N-column bus network, where the bus width of each bus network is log N-bit.
Corollary 5 [9] . The list ranking problem can be solved in Oð1Þ time on a linear N RAPWBN with an N-row by N-column bus network, where the bus width of each bus network is N 1=c -bit ðN 1=c > log NÞ for any constant c and c ! 1.
The Integer Sorting Problem
Given a data sequence A ¼ fa i g, 0 i < N, of N Oðlog NÞ-bit integer numbers, the integer sorting problem is to rearrange these N numbers into ascending or descending order. Trahan et al. [28] had proposed an Oð1Þ time integer sorting algorithm on the N 1þ1=c RMBM with an N-row by N-column bus network, where c is any constant for c ! 1, and the bus width of each bus network is assumed to be log N-bit. The main idea of the algorithm proposed by Trahan et al. [28] is to reduce the integer sorting problem to the list ranking problem. It can be described by the following four steps. First, link the numbers which have the same value into one linked list, and then identify the head and the tail of each linked list. Next, link these linked lists to one linked list according to the value of the head and the tail. Then, apply the list ranking to rank these numbers on the linked list. Finally, each number is copied to the position according to its associated rank. For example, assume A ¼ f0; 1; 2; 2; 3; 4; 3; 0g. After the computation of Step 1, the linked list created is next ¼ fnil; nil; nil; 2; nil; nil; 4; 0g. Then, after the computation of Step 2, the linked list is next ¼ fnil; 7; 1; 2; 3; 6; 4; 0g and the head a 0 ¼ 5. After the computation of Step 3, the rank of each data is rank ¼ f0; 2; 3; 4; 5; 7; 6; 1g. Finally, the sorted sequence A 0 is A 0 ¼ f0; 0; 1; 2; 2; 3; 3; 4g after the computation of Step 4. As stated by Trahan et al. [28] , these four steps can be easily implemented on a linear RAPWBN, each using N processors with an N-row by N-column bus network, respectively. Assume the bus width of each bus network is w-bit. The time complexity of these four steps are analyzed as follows. The first, second, and fourth steps take Oðd log N w eÞ time for routing data to their corresponding positions, respectively. The third step takes OðT Þ time by the list ranking algorithm as stated previously. Hence, the total complexity of the algorithm is OðT þ d log N w eÞ. This leads to the following lemma.
Lemma 3. Given N Oðlog NÞ-bit integer numbers, these N numbers can be sorted in OðT þ d
log N w eÞ time on a linear N RAPWBN with an N-row by N-column bus network, where the bus width of each bus network is w-bit for 2 w d ffiffiffiffi ffi N p e and T ¼ blog w Nc þ 1. Lemma 3 is quite efficient. Since the bus width w of each bus network is bounded within the range ½2; d ffiffiffiffi ffi N p e, the time complexity of the algorithm can speed-up by increasing or reducing the bus width w. For example, it can be run in Oðlog N= log log NÞ time under the Oðlog NÞ-bit natural bus width. It is also quite efficient even if w is less than log N. Assume there are 2 32 processors and each has a data of size 32-bit. Also assume the bus width of each bus network is 16-bit. Then, it requires only Oð11Þ unit times to sort of these 2 32 data. In particular, if the bus width is extended to N 1=c -bit (for N 1=c > log N), where c is a constant and c ! 1, then,
is also a constant. For example, let c ¼ 6 and assume the bus width of each processor is 32-bit. Then,
That is, when the bus width w is extended to 41-bit, it requires only Oð8Þ unit times to sort of these 2 32 data items. This leads to the following two corollaries. Corollary 6. Given N Oðlog NÞ-bit integer numbers, these N numbers can be sorted in Oðlog N= log log NÞ time on a linear N RAPWBN with an N-row by N-column bus network, where the bus width of each bus network is log N-bit.
Corollary 7.
Given N Oðlog NÞ-bit integer numbers, these N numbers can be sorted in Oð1Þ time on a linear N RAPWBN with an N-row by N-column bus network, where the bus width of each bus network is N 1=c -bit ðN 1=c > log NÞ for any constant c and c ! 1.
Note that the derived result is better than that of Trahan et al. [28] . That is, based on the wider bus network architecture, the integer sorting problem can be solved in the same time complexity but the processor complexity can be reduced from N 1þ1=c to N. This is an efficient approach to improve the integer sorting algorithm by increasing the bus capacity rather than by increasing the processor complexity.
THE CHANNEL-ASSIGNMENT PROBLEM
Given N pairs of components on a two-sided printed circuit board, each pair of components is to be placed on a specific vertical line and connected by a horizontal line segment. Two pairs of components can be shared on the same channel if their connections do not illegally overlap with each other. The channel-assignment problem minimizes the total number of channels. In our algorithm, we will use the following notation of an interval graph to state the channelassignment problem. Let I ¼ fI i g, 0 i < N, be a family of N intervals on a real line for an interval graph. Each interval I i is represented by ½a i ; b i , where a i represents the left endpoint and b i represents the right endpoint of I i . Without loss of generality, we may assume that the endpoint a i is smaller than that of b i and these 2N endpoints all are distinct integers. The minimum coloring problem of an interval graph is defined to assign a color to each interval such that overlapping intervals do not share the same color. To show equivalence of the channel assignment problem to the minimum coloring problem on an interval graph, let these N pairs of components correspond to N intervals on a real line, where both the left and the right endpoints of each interval are corresponding to the position of a pair of components. Through such a transformation, the channelassignment problem is mapped to the minimum coloring problem of an interval graph.
To solve the channel-assignment problem, Gupta et al. [5] had proposed a popular sequential algorithm, the main idea follows. Let 2N endpoints of N intervals be sorted in ascending order. First, assume that all colors are available and a stack is created. Then, the intervals are colored sequentially from the smallest left endpoint to the largest right endpoint. If the input interval is a left endpoint, then assign a color to it and push it into the stack; otherwise, the input interval is a right endpoint and its color is released and it is popped from the stack. Each released color can be reused for the next interval the left endpoint of which is the closest subsequent one. Ultimately, each interval will be assigned its associated color after scanning the 2N endpoints. That is, all pairs of components will be assigned to their associated channels.
Based on the sequential algorithm, the high level description of the proposed parallel algorithm can be described in the following four steps.
1. Sort these 2N endpoints in ascending order.
Based on the density of the intervals along each
coordinate, determine the minimum number of colors which are required to color these N intervals and number these colors from 0 to À 1. 3. Determine the successive interval of each interval which can share the same color provided that it exists. Thus, there are linked lists to be created. 4. Find the head of each linked list and then broadcast its associated color number from it to the intervals along the linked list. In the following parts of this section, we will explain the detailed implementation of these four steps.
Let cðjÞ, 0 j < 2N, be the endpoints sequence of the left and right integer endpoints which are sorted in ascending order. Without loss of generality, we may assume that the endpoint a i is smaller than that of b i and these 2N endpoints all are distinct integers. For example, the sorted endpoints sequence of Fig. 4 
is as follows:
fcð0Þ; cð1Þ; cð2Þ; . . . ; cð14Þ; cð15Þg
Let each cðjÞ represent not only the attribute of an interval but also its coordinate. For example, cð14Þ ¼ b 7 represents the right endpoint of the interval I 7 with its coordinate being 14. Like Sprague and Kulkarni [26] , we define a density sequence dð0Þ; dð1Þ; . . . ; dð2N À 1Þ as follows: For a fixed integer k, 0 k < 2N, dðkÞ is the density of intervals at coordinate k to be the number of intervals which contain k þ , where 0 < < 1. For example, the density sequence of Fig. 4 
fdð0Þ; dð1Þ; dð2Þ; . . . ; dð14Þ; dð15Þg ¼ f1; 2; 1; . . . ; 1; 0g:
Note that jdðjÞ À dðj À 1Þj ¼ 1 for 1 j < 2N.
The density sequence dðjÞ, 0 j < 2N, can be obtained by first setting tðjÞ as follows:
tðjÞ ¼
if cðjÞ is a left endpoint;
À1 if cðjÞ is a right endpoint:
Then,
Let be the minimum number of colors which are needed to color the family of intervals I. Thus,
Based on the density sequence dðjÞ, for each interval, define a successor sequence, denoted by nextðjÞ, 0 j < 2N, as the nearest successive interval which can share the same color among all intervals. nextðjÞ, 0 j < 2N, can be formularized as follows:
That is, if cðjÞ is a left endpoint a i of an interval I i , then the nextðjÞ is set to its right endpoint b i . If cðjÞ is a right endpoint b i of an interval I i , then the nextðjÞ is set to the left endpoint a k of interval I k , where I k is the nearest successive interval for cðb i Þ < cða k Þ 2N À 1 and the density dða k Þ of a k is equal to the density dðb i Þ þ 1 of b i ; otherwise, if such a left endpoint a k cannot be found by b i , then the nextðjÞ is set to nil. For example, in Fig. 4, cð1Þ is a left endpoint a 1 , nextð1Þ is therefore set to the coordinate 2 of the right endpoint b 1 . Since cð2Þ is a right endpoint b 1 , both dða 2 Þ and dða 4 Þ are equal to dðb 1 Þ þ 1 for cðb 1 Þ < cða 2 Þ < cða 4 Þ. a 2 is the nearer to b 1 of the two left endpoints a 2 and a 4 , nextð2Þ is therefore set to the coordinate 3 of the left endpoint a 2 of the interval I 2 . Since we cannot find a left endpoint as the successor of b 7 , nextð14Þ is set to nil. Based on dðjÞ, , and nextðjÞ as defined above, we have the following theorem. Proof. By (5), there is a maximal clique K in the interval graph; clearly, it requires colors to color it. By the definition of nextðjÞ, the right endpoint of each interval I i will choose the interval after it whose left endpoint is the nearest to it and the density of this left endpoint is larger than that of the right endpoint of I i by one. Since there is a pair of endpoints for each interval, the left endpoint of it contributes a weight of 1 and the right endpoint of it contributes a weight of À1. By (6), it is impossible to have two right endpoints of two intervals to choose the same left endpoint of an interval as their successive interval because all endpoints are distinct. Hence, each interval belongs to exactly one linked list and there are exactly linked lists to be created due to K . t u
From the successor function nextðjÞ, the head of each linked list, denoted by headðjÞ, 0 j < 2N, is identified and marked by
Hence, the color number of each head can be numbered by
where 0 j < 2N. Based on (3)- (8) and Theorem 1, we will propose an efficient parallel algorithm for the channel-assignment problem on a linear 2N RAPWBN with a 2N-row by 2N-column bus network. Initially, the left and right endpoints of these N intervals a j and b j are stored in the local variables a j and b j of processor P j , 0 j < N, respectively. Finally, the associated channel of each interval I j is stored in the local variable asgnðjÞ of processor P j , 0 j < N. The detailed channel-assignment algorithm (CAA) is shown in the following: Assume the input intervals are those shown in Fig. 1 . An illustration of algorithm CAA is shown in Figs. 9a, 9b , and 9c.
Algorithm CCA Input: A family I of intervals fI i j 0 i < Ng of an interval graph, where I i ¼ ½a i ; b i . Output: The channel-assignment asgnðjÞ; 0 j < N.
0. begin 1. // Sort the left and right endpoints of N intervals into ascending order.// 1.1: Processor P j , 0 j < 2N, establishes the local connections fÀS i;j ; þS i;j ; 0 i < Ng and f]S ðj mod NÞ;j ; ÀS ðj mod NÞ;j ; þS ðj mod NÞ;j g. Then, processor P j , 0 j < N, broadcasts b j to processor P jþNÀ1 using jth-row bus network. 1.2: By Lemma 3, sort a j and b jþNÀ1 ; 0 j < N, into ascending order and store each endpoint a (or b ) to the local variable cðÞ of processor P for cð0Þ cð1Þ Á Á Á cð2N À 1Þ. 2. // Compute dðjÞ. // 2.1: Processor P j , 0 j < 2N, sets tðjÞ to 1 if it holds a left endpoint; sets tðjÞ to À1, otherwise. 2.2: By Lemma 1, perform the prefix sum on tðjÞ to obtain dðjÞ ¼ P j i¼0 tðiÞ for 0 j < 2N. 3. // Compute nextðjÞ. // 3.1: Processor P j , 0 j < 2N, establishes the local connection fÀS i;j ; þS i;j ; 0 i < Ng, and then establishes the local connection f]S i;j ; ÀS i;j ; þS i;j g if cðjÞ ¼ a i or cðjÞ ¼ b i ; 0 i < N. Then, processor P j with cðjÞ ¼ b i ; 0 j < 2N; 0 i < N, broadcasts the coordinate of b i on the ith-row bus network from the port ]S i;j . 3.2: Processor P j with cðjÞ ¼ a i ; 0 j < 2N, 0 i < N, reads the data from the port ]S i;j through the ith-row bus and stores it into nextðjÞ. 3.3: Processor P j , 0 j < 2N, establishes the local connection fÀS i;j ; þS i;j ; 0 i < Ng, then establishes the local connection f]S r;j ; ÀS r;j , r ¼ dðjÞ À 1g if cðjÞ ¼ a i , and then establishes the local connection f]S r;j ; þS r;j ; r ¼ dðjÞg if cðjÞ ¼ b i . Then, processor P j with cðjÞ ¼ a i and r ¼ dðjÞ À 1, 0 j < 2N, 0 i < N, broadcasts the coordinate of a i on the established bus from the port ]S r;j . 3.4: Processor P j with cðjÞ ¼ b i and r ¼ dðjÞ, 0 j < 2N, 0 i < N, reads the data from the port ]S r;j through the rth-row bus and stores it into nextðjÞ. 4. // Compute headðjÞ. // 4.1: Processor P j , 0 j < 2N, establishes the local connections fÀS i;j ; þS i;j ; 0 i < Ng, then establishes the local connection f]S j;j ; ÀS j;j g if cðjÞ ¼ a i , and then establishes the local connection f]S nextðjÞ;j ; þS nextðjÞ;j g if cðjÞ ¼ b i and nextðjÞ 6 ¼ nil. Then, processor P j with cðjÞ ¼ b i and nextðjÞ 6 ¼ nil; 0 j < 2N; 0 i < N, broadcasts a signal Ã on the established bus from the port ]S nextðjÞ;j . 4.2: Processor P j with cðjÞ ¼ a i , 0 j < 2N, 0 i < N, sets headðjÞ to 0 if it can read the signal Ã from the port ]S j;j through the established bus; sets headðjÞ to 1, otherwise. 4.3: Processor P j , 0 j < 2N, sets headðjÞ ¼ 0 if cðjÞ ¼ b i ; 0 i < N. 5. // Compute colorðjÞ. // By Lemma 1, perform the prefix sum on headðjÞ to obtain colorðjÞ ¼ ð P j i¼0 headðiÞÞ À 1 for 0 j < 2N. Then, processor P j , 0 j < 2N, sets colorðjÞ to nil if headðjÞ ¼ 0. 6. // Assign the color to all intervals. // 6.1: Processor P j , 0 j < 2N, establishes the local connections fÀS i;j ; þS i;j ; 0 i < 2Ng and f#S j;j , ÀS j;j ; þS j;j g, and then establishes the local connection f#S nextðjÞ;j ; ÀS nextðjÞ;j ; þS nextðjÞ;j g if nextðjÞ 6 ¼ nil. Then, processor P j with cðjÞ ¼ a i and headðjÞ ¼ 1; 0 j < 2N; 0 i < N, broadcasts colorðjÞ on the established bus from the port ]S j;j . 6.2: Processor P j with cðjÞ ¼ a i , 0 j < 2N, 0 i < N, reads the data from the port ]S j;j through the established bus and stores it into asgnðjÞ. 7. // Copy the assigned color number of each interval back to its corresponding position. // 7.1: Processor P j establishes the local connection fÀS i;j ; þS i;j ; 0 i < Ng if 0 j < 2N, then establishes the local connection f]S j;j ; ÀS j;j ; þS j;j g if N j < 2N, and then establishes the local connection f]S i;j ; ÀS i;j ; þS i;j g if cðjÞ ¼ a i , 0 i < N; 0 j < N. Then, processor P j with cðjÞ ¼ a i ; 0 i < N; 0 j < N, broadcasts asgnðjÞ on the ith-row bus from the port ]S i;j . 7.2: Processor P j , N j < 2N, reads the data from the port ]S jÀN;j through the ðj À NÞth-row bus and stores it into tempðjÞ. 7.3: Processor P j , 0 j < 2N, establishes the local connections fÀS i;j ; þS i;j ; 0 i < Ng and f]S ðj mod NÞ;j ; ÀS ðj mod NÞ;j ; þS ðj mod NÞ;j g. Then, processor P j , N j < 2N, broadcasts tempðjÞ back to asgnðj À NÞ of processor P jÀN using ðj À NÞth-row bus network. 0 i < N; N j < 2N. Then, processor P j with cðjÞ ¼ a i ; 0 i < N; N j < 2N, broadcasts asgnðjÞ on the ith-row bus from the port ]S i;j . 7.5: Processor P j , 0 j < N, reads the data from the port ]S j;j through the jth-row bus and stores it into asgnðjÞ. 8. end By increasing the bus width of each bus network, algorithm CCA can be easily modified to run with better efficiency. When the bus width of each bus network is larger than log N-bit, Steps 3.1-3.4, 6.1-6.2 and 7.1-7.5 of algorithm CAA take Oð1Þ time. Hence, the total time complexity of algorithm CAA is dominated by Lemma 3 in Step 1.2 and Lemma 1 in Steps 2.2 and 5, respectively. If these two lemmas are replaced by their corresponding corollaries, then the following two corollaries are hold.
Corollary 8. Given a family I of N intervals, the channelassignment problem can be solved in Oðlog N= log log NÞ time on a linear N RAPWBN with an 2N-row by 2N-column bus network, where the bus width of each bus network is log N-bit.
Corollary 9. Given a family I of N intervals, the channelassignment problem can be solved in Oð1Þ time on a linear 2N RAPWBN with a 2N-row by 2N-column bus network, where the bus width of each bus network is N 1=c -bit ðN 1=c > log NÞ for any constant c and c ! 1.
CONCLUDING REMARKS
The transmission time of a data item between processors is determined by the size of the data item and the bus capacity. It is not suitable to improve the efficiency of an algorithm by simply increasing the number of processors because the VLSI silicon area would then be increased substantially. An alternative approach is to extend the bus bandwidth for the new parallel processing system. As we can see, the silicon area of the reconfigurable array of processors with a wider bus network can be saved substantially by increasing the bus capacity rather than by increasing the processor complexity according to the result as shown by Li and Maresca [12] , [15] . The goal is not only to make a spatial improvement by saving the silicon area of the chip, but also to increase the system efficiency. Based on the wider bus network architecture, algorithms to solve the channel-assignment problem have been derived in this paper. Table 1 exhibits a comparison of our results with existing ones. In particular, compared to the algorithms proposed by Olariu et al. [17] and Lin [14] , our algorithms both decreased execution time by tuning the bus bandwidth, and also reduced the number of processors used in the computation. Thus, our algorithms are superior in both time and processor complexity.
Note that the complexity of the local switch configurations in the RAPWBN is fully dependent on the number of row buses and the bandwidth between processors. Only six functions are required to be embedded in the local switch for the proposed algorithms in this paper. One may argue that the bus driving capacity is limited and constant time propagation delay is assumed for the bus networks. Technically, however, fan-out degree of a CMOS driver is limited. This problem can be improved by either increasing the driving power or through the bus buffer [19] . The propagation delay is dependent on advanced VLSI technology and interconnect material. Certainly, linear or logarithmic propagation delay may be more realistic for metal interconnects. We agree that the connection delay will depend on the problem size so that constant time broadcast delay may not be quite accurate. However, we are confident that the broadcast delay is very small. With advanced VLSI technology, there are at least two projects that can be used to demonstrate the feasibility and benefits of the RAPWBN. One is the YUPPIE chip [12] , [15] , and the other is the GCN (gateconnection network) chip [25] . For a 10 6 processors YUPPIE, only 16 machine cycles are required for broadcasting. GCN has further reduced the delay by using pre-charged circuits. Thus, the assumption that the constant time propagation delay has been adopted widely by other researchers [8] , [9] , [11] , [12] , [14] , [15] , [16] , [17] , [22] , [28] , [30] .
Bus networks can be easily implemented using current technologies. Metal interconnects result in larger delay than those using optical fibers. Optical interconnections offer many advantages over the electronic counterpart including high connection density, low crosstalk, and relaxed bandwidth-distance product [2] . Currently, the optical interconnection techniques are used to establish the reconfigurability and simultaneous switching for massively parallel processing system [20] . It should be mentioned that Schuster and Ben-Asher [24] have shown that the constant time assumption can be attained if the reconfigurable bus is manufactured using optical fibers. The Jitney Optical Bus, for example, with twenty channels (500 Mb/s/ch) has been designed for high speed parallel computing and successfully demonstrated in IBM AS/400 and RS6000 power parallel systems testbeds [10] . Due to these new developments in the technology, the computational models with wider reconfigurable optical buses are likely to become feasible architectures in the near future.
Since our algorithms are not optimal when we consider both processors and switches, there is space for improvement. Future research includes the reduction of switches or the bus width to achieve constant-time solutions for the channel assignment problem.
TABLE 1 Results Comparison for Parallel Channel-Assignment Algorithms
The PRAM model could be implemented by three technologies: crossbar switch, multiple buses, and multistage network. Assume that N processors are to be used, then their corresponding switching complexity is N 2 cross points, N shared buses, and N log N switch boxes, respectively [7] . Jennifer Seitzer received the PhD degree in 1997 for her work in theoretical artificial intelligence. She is in her third year as an assistant professor in the Computer Science Department at the University of Dayton, Ohio. In 1998, she was invited to work in the Networking Lab studying computer networking under Douglas Comer at Purdue University for three months. Her current research and study involves both intelligent systems and computer networking for which she has been funded by several organizations including the US National Science Foundation. Dr. Seitzer's current projects involve distributing a large knowledge-based reasoning system onto a Beowulf cluster and extending rule discovery to metapattern discovery with an emphasis on cycle mining.
. For more information on this or any computing topic, please visit our Digital Library at http://computer.org/publications/dlib.
