AbstractÐBSR (Broadcasting with Selective Reduction) is a PRAM more powerful than any CRCW PRAM. In order to extend the Broadcast Instruction of BSR and make it more useful for a large class of applications, this article permits it to use a general form of selection, specifically, an arbitrary relational expression. BSR with general selection is denoted by BSR . Thus, BSR or BSR with k criteria (k b I) is BSR in a special case. An efficient implementation for the Broadcast Instruction of BSR is proposed, requiring Iakth of the circuits used by the best previous implementation of BSR with k criteria. Of all PRAMs, BSR is the most powerful in computation.
T is well-known that the Parallel Random Access Machine (PRAM) is the most popular model of parallel computation. Of its three variants most commonly used, the Concurrent-Read Concurrent-Write (CRCW) PRAM is the most powerful. The PRAM is also referred to in the literature as a Shared Memory SIMD (Single-Instruction stream, Multiple-Data stream) Computer. At the implementation level, the PRAM requires a Memory Access Unit (MAU), sometimes also known as an Interconnection Unit (IU), to connect its processors to its shared memory locations [2] , [22] . The IU of a CRCW PRAM can supply all the data paths between processors and memory locations needed by a Concurrent-Read (CR) or Concurrent-Write (CW) instruction and, for a CW instruction, the IU can also complete some function to resolve the write conflict, such as selection by priority (for PRIORITY CW), comparison (for COMMON CW), or other function (sum or product, etc., for COMBINING CW). To bring the efficiency of the IU into full play, the Broadcasting with Selective Reduction (BSR) model of computation was proposed in [4] . BSR is a PRAM extension that possesses all the features of the CRCW PRAM with one additional instruction, namely the BROAD-CAST Instruction (BI), and requires no more resources asymptotically than the EREW PRAM [23] . With the BI, constant time solutions have been obtained on BSR for many problems, such as sorting, prefix computation, parenthesis matching, and tree problems [2] , [4] , [6] , [32] . To obtain optimal solutions for more applications, it is necessary to extend single selection in the BI of BSR to double or multiple selection and the problem of efficiently implementing double or multiple selection was mentioned as an open problem in [23] . In [7] , [8] , instead of single selection (one criterion) selection using k -terms (k criteria, k b I) in the BI was permitted in BSR, to which we refer as BSR k in this article, and a direct implementation based on that in [4] was given for the BSR k . Another direct implementation based on that in [23] was recently proposed in [18] . Thus, BSR k allows optimal solutions to more problems, such as ANSV problem [35] , GPC problem [33] , and many problems in computational geometry [7] , [8] . However, there are still many other problems that need other combinations of the selection criteria, such as sorting vertices of multidimensional space and querying and reporting on a database [27] , [34] . Therefore, the BI should, in general, allow an arbitrary relational expression for selection, as originally proposed in [2] (see problem 11.41). In this article, such a general form of selection is permitted in the BI of BSR and the resulting model is denoted by BSR . As a model of parallel computation, BSR should satisfy two requirements, namely, simplicity and implementability [21] . Simplicity allows us to describe parallel algorithms easily and to analyze mathematically important performance measures such as speed and memory utilization. Implementability guarantees that the parallel algorithms developed for the model are meaningful and easily implementable at the level of hardware (parallel computers). Obviously, a model without an implementation is of little (if any) value. On the other hand, there are two levels for the implementation, i.e., logical and physical. The logical implementation involves the structure and functions of logical circuits and is typically used in research, while the physical implementation is concerned with true electronic circuits (including the consideration of such issues as the length of wires, etc.) and belongs naturally in industry. This article focuses on the logical implementation of BSR and offers some applications to illustrate the simplicity of describing parallel algorithms with BSR . Compared to the best previous implementation of BSR k , the implementation of BSR is efficient as it requires Iakth of the circuits used by the implementation of BSR k . In Section 2, BROADCAST Instructions and some symbols are defined. In Section 3, some background is given for the implementation of BSR . In Sections 4 and 5, an efficient implementation for the BI of BSR is described for n 0 (where n is the size of a problem and 0 is the word length of a processor) and n b 0, respectively. A comparison of PRAMs is given in Section 6. Section 7 describes some applications. We conclude in Section 8 and present an open problem for future research.
BROADCAST INSTRUCTIONS AND NOTATION
The BSR model of parallel computation is a PRAM augmented with an additional form of concurrent access to shared memory, namely, the BROADCAST Instruction (BI). For ER, EW, CR, or CW of a PRAM with n processors, it is known that 1. In each unit of time, each processor is allowed to execute an instruction or to stay idle [21] , i.e., in each unit of time, there are n H (n H n) processors active. 2. In each unit of time, n H active processors access m (m n H ) memory locations simultaneously. 3. In each unit of time, an active processor can access only one memory location. As mentioned in Section 1, to bring the efficiency of the IU into full play, i.e., to achieve as much parallelism as possible, BSR satisfies the following conditions:
1. In each unit of time, each processor is allowed to execute an instruction or to stay idle, i.e., in each unit of time, there are n H (n H n) processors active. 
All three phases are performed simultaneously for all active processors i , I i n H , and all memory locations u j , I j m, as illustrated in Fig. 1 . Formally, a BI of BSR is written as follows:
When the ranges of i and j in the above BI are understood, the BI can be abbreviated as
If the criterion t i ' l j in the BI of BSR is replaced with k criteria, i.e., Fig. 1 . Three phases of a BROADCAST instruction. 
is used as the selection condition of d i for u j , then the BI of BSR is obtained, where
It is shown in [23] that BSR with single selection is more powerful than any CRCW PRAM and yet its logical implementation requires no more resources asymptotically than that of the EREW PRAM. Note also that BSR is BSR k with k I and BSR k is BSR in a special case. Consequently, BSR k is more powerful than BSR and BSR is the most powerful in computation. In any direct implementation for the BI of BSR k , a host of buses, registers, and combinational circuits are needed since all data in the k criteria of the BI must be processed at once. For the same reason, any direct implementation for the BI of BSR will be much more complicated. Therefore, an indirect implementation for the BI of BSR should be a good way to go.
BACKGROUND FOR THE IMPLEMENTATION OF BSR
There are three published implementations of BSR. They include i) using memory buses [4] , ii) using a mesh of trees [6] , and iii) using circuits for sorting and prefix computation [23] . In each of the above implementations, the entire circuit has depth ylogn. Implementations i) and ii) have size yn P , and implementation iii) has size ynlogn, which is optimal [23] . There are also three published implementations of BSR k corresponding to the above three implementations of BSR, i.e., i) using memory buses, ii) using a mesh of trees [7] , [8] , and iii) using circuits for sorting and prefix computation [18] . The depth of the entire circuit is still ylogn for each of these implementations of BSR k , but the sizes are ykn P , ykn P , and ykn P logn, respectively. Why does BSR have an ynlogn implementation, but an ynlogn implementation has not been found for BSR k ? We believe two reasons account for this situation:
g of selection operations, ª`,º ª ,º ªb ,º and ª! º are all linear order comparisons. Selection operation ª º can be regarded as ª º except ª`º and ªT º can be regarded as ªallº except ª .º Thus, for the BI of BSR, u j X`d i j t i ' l j , by sorting t i Y d i on t i for all i and l j on l j for all j and merging them into an ordered linear order array, all data d i selected by l j for u j can be obtained directly or indirectly (for ª º and ªT º) from an interval in the ordered linear order array, for all j. Therefore, BSR has an ynlogn implementation [23] .
2. For the BI of BSR k , when k b I, the selection space is multidimensional and we do not know how to find, on tags and limits, an ordered linear order array in which an interval can supply all data needed by u j , for all j. Therefore, an optimal implementation for BSR k has not been found. Based on the BSR implementation iii), only an ykn P logn implementation was obtained for BSR k [18] .
For the same reason, we have not found an ynlogn (or yknlogn) implementation for BSR based on the optimal one for BSR. Among the three existing implementations of BSR k , we choose the mesh of trees as the basis for the implementation of BSR . In other words, the implementation of BSR in this article can be regarded as a simplified BSR k implementation ii), whose size is Iakth that of BSR k , but whose function is an extension of BSR k .
The architecture and the IU of BSR is shown in Fig. 2a  and 2b , respectively, where ge f and ge mean Concurrent Access Tree for Broadcasting and Concurrent Access Tree for Reduction [8] , respectively. The memory space of the system is distributed among n processors. As a basis for our design, the Selection & Reduction Circuit of BSR k (implementation ii)) is given in Fig. 3 . Obviously, this is a direct implementation for the k criteria, and for each processor j, datum d j and k tags t
must be broadcast at once, so the size of the entire circuit is ykn P . By using a New Comparator, BSR implements its general selection indirectly, and the Selection & Reduction Circuit of BSR is given in Fig. 4 . Obviously, it is the same as that of BSR (implementation ii)) except the Comparator. From the IU of BSR in Fig. 4b , we know that the size of the entire circuit is n Â size of ge f size of eletion nit size of ge X Since a ge f or ge is of size yn and a Selection Unit ( composed of n Comparators ) is also of size yn, the size of the entire circuit is yn P for BSR . Therefore, in the following two sections, we will focus on how to implement the New Comparator efficiently.
NEW COMPARATOR (I)ÐIMPLEMENTATION FOR THE BI OF BSR WHEN n 0
When n 0, where n and 0 are the size of a problem (in general, n is also the number of processors) and the word length of a processor, respectively, the BI of BSR can be implemented efficiently by adding a new (generalized) compare operation to the set f`Y Y Y !Y bY T g and using a bit-correspondence technique [35] to program a procedure.
A New Compare Operation o Ã and Its Implementation
A new (generalized) compare operation, to be added to the set f`Y Y Y !Y bY T g, is denoted by o Ã and defined as follows:
o f i if and only if e f T H, where is the bitwise AND operation. 
Bit-Correspondence Technique
In what follows, logic operands are applied bitwise. Thus, is bitwise AND, is bitwise OR, and $ is bitwise NOT. Proof. The BI of BSR is
Since k is a given constant, the BI above can be implemented in constant time by the following procedure: 
This completes the proof. t u
NEW COMPARATOR (II)ÐIMPLEMENTATION FOR THE BI OF BSR WHEN n b 0
In most applications, 0 is a constant and n b 0. In this section, the implementation for the BI of BSR is based on that in Section 4, but has nothing to do with 0. The bitcorrespondence technique is embodied in hardware.
Two Registers and Six Special Instructions
The following changes need to be made to the architecture: We need to add two registers with n bits to each processor and six special instructions to the instruction repertoire, where n is the number of processors. The two registers of processor j are called e j and f j , respectively. The ith bit of e j (f j ) corresponds to processor i, for I iY j n. The six special instructions are described as follows:
sH X x j X`d i : The datum broadcast by processor i (i.e., d i ) is selected for reduction`when the ith bit of e j is I, for I i n;
sI X e j X t i ' l j : The ith bit of e j is assigned the value I when t i ' l j is true, or the value H when t i ' l j is false, for I i n; sP X f j X t i ' l j : The ith bit of f j is assigned the value I when t i ' l j is true, or the value H when t i ' l j is false, for I i n;
sQ X e H j X t i ' l j : When t i ' l j is true, the ith bit of e j is not changed; otherwise, when t i ' l j is false, the ith bit of e j is assigned the value H, for I i n; sR X f H j X t i ' l j : When t i ' l j is true, the ith bit of f j is not changed; otherwise, when t i ' l j is false, the ith bit of f j is assigned the value H, for I i n; sS X e j : e j is assigned the result of the bitwise or of current values of e j and f j .
Implementation of the Two Registers and the Six Special Instructions
Let NBI be the decoding signal of Normal Broadcast 
Thus, the New Comparator (II) is obtained in Fig. 5b . Note that a D-trigger can be composed of no more than six gates, so the additional cost is no more than IS P Â T PU gates.
Since the old Comparator must be composed of no less than T Â 0 gates, when 0 b R, the cost of the New Comparator cannot be more than twice that of the old Comparator.
Theorem 5. With the New Comparator (II), the BI of BSR can be implemented efficiently when n b 0.
Proof. The BI of BSR is 
COMPARISON OF PRAMS
For two models e and f of parallel computation, the statement that model e is more powerful than model f means that with the same number of processors,
. all problems can be solved on e in no more time than on f, and . some problems can be solved on e in less time than on f. Thus, CREW is more powerful than EREW, CRCW is more powerful than CREW, and BSR is the most powerful in PRAMs.
Since the relation between processors and memory locations is a key in a PRAM model (many processors have access to a single shared memory unit [21] , or to a shared memory with multiple modules [22] ), it is reasonable that the relation be used as an object for the comparison of PRAMs.
Let us consider the set of n processors P i (I i n) and the set w of m memory locations M j (I j m) in a PRAM and assume n m for ease of presentation. The n memory locations are in the Shared Memory, such as that in Fig. 6 (i.e., the Fig. 1.3 of [21] , where p n), or in the n modules of the Shared Memory with Multiple Modules respectively, such as that in Fig. 7 (i. e., the Fig. 1.2 of  [22] , where m n).
In a PRAM, each processor can access each memory location and each memory location can be accessed by each processor. Therefore, there is a two-way data path between each processor and each memory location for reading and writing. These paths can be illustrated as in Fig. 8 . If at any time, for reading or writing, the state of passable paths is a one-one (partial) correspondence between and w, then the PRAM is the EREW PRAM, as shown, for example, in Fig. 9 .
If at any time, for reading, the state of passable paths is a one-many (partial) correspondence from w to under which one element in corresponds to one element in w at most and, for writing, the state of passable paths is a manyone (partial) correspondence from to w under which one element in corresponds to one element in w at most, then the PRAM is the CRCW PRAM as shown, for example, in Fig. 10 .
Similarly, if at any time, for reading, the state of passable paths is a one-many (partial) correspondence from w to under which one element in corresponds to one element in w at most and, for writing, the state of passable paths is a one-one (partial) correspondence from to w, then the PRAM is the CREW PRAM.
If at any time, for reading, the state of passable paths is a one-one (partial) correspondence from w and and, for writing, the state of passable paths are a many-many full correspondence from a subset of to a subset of w, then the PRAM is the BSR PRAM, as shown, for example, in Fig. 11 .
Reading is moving data from shared memory locations to registers (or local memory locations) of processors and writing is vice versa. The relation between processors and memory locations for a CR or CW instruction is of many-one, while the relation for a BI is of many-many. Many-many is symmetric, therefore, reading and writing of a BI can also be understood as if, at any time, for reading, the state of passable paths is a many-many full correspondence from a subset of w to a subset of and, for writing, the state of passable paths is a one-one (partial) correspondence to w, then the PRAM is the BSR PRAM, as shown, for example, in Fig. 12 .
In general,
1.
The sum total of all the states of passable paths in EREW, CREW, CRCW, or BSR is a cover on the many-many full correspondence between and w (Fig. 8 ). 2. BSR includes CRCW as a special case, CRCW includes CREW, and CREW includes EREW.
Point 1 above implies that the cost of implementation for EREW, CREW, CRCW, or BSR may be asymptotically the same. We note here that the time of memory access in EREW, CREW, or CRCW is typically assumed as yI for simplicity when, in fact, y(log n) circuit stages are needed for a processor to be able to access each memory location of n memory locations. Therefore, in BSR (BSR k , or BSR ), if the number of circuit stages is y(log n) between and w, then the time of memory access is assumed to be yI for the same reason. We now illustrate point 2. Consider the following computation:
Three arrays eY fY gIXXn are given as input, where e and g are functions on fIY PY XXXY ng. It is required to produce an array hIXXn as output, according to the following computation:
BEGIN for j X I to n dopar xj X fej; hgj X xj; (xj X fej and hgj X xj can be replaced by hgj X fej.)
There are four cases to consider:
. e and g are injections A the solution can be executed on EREW; . e is a noninjection and g is an injection A the solution can be executed on CREW; . (e is an injection and g is a noninjection A the solution can be executed on ERCW;) . e and g are noninjections A the solution can be executed on CRCW. No matter what functions e and g are, the solution on BSR is as follows: BEGIN for j X I to n dopar xj X AE fi j i ej;
hj X`xi j gi j; parado END.
In other words, the Broadcast Instruction of BSR can include functions of ER, EW, CR, and CW.
On the other hand, a BI can give rise to a cover on the many-many full correspondence between and w, while n ER (EW, CR, or CW) instructions are needed to engender a cover on the many-many full correspondence. Thus, for some problems, solutions may be found on BSR that are faster than those possible on CRCW. For example, sorting can be done on BSR in yI time [2] , [32] , while y(logn) time is needed on CREW or CRCW for the same problem (see Corollary 4.5 of [21] ). Therefore, BSR is strictly more powerful than CRCW.
APPLICATIONS
When describing algorithms on BSR , the BI of BSR can be used directly (without the need for the special instructions). Since BSR or BSR k is BSR in a special case, all the existing algorithms on BSR or BSR k , such as [1] , [2] , [3] , [4] , [5] , [6] , [7] , [8] , [9] , [10] , [11] , [12] , [13] , [14] , [15] , [16] , [17] , [18] , [19] , [20] , [23] , [24] , [25] , [26] , [28] , [29] , [30] , [31] , [32] , [35] , [36] , [37] , are also solutions on BSR for those problems. With general selection, BSR allows constant time solutions to more applications. To show the power of BSR , we give BSR constant time solutions to the following problems:
1. Sorting vertices of a multidimensional space lexicographically, 2. Three counting problems for a relational database.
Sorting Vertices of a Multidimensional Space Lexicographically
Sorting is a basic problem in computer science, and the existing parallel algorithms for the problem are almost confined in a linear space [2] , [21] , [22] , [32] . In many applications, it is needed to sort vertices of a multidimensional space lexicographically. In a database of personnel management, for example, it is often needed to sort persons by their names, professional titles, ages, and so on. The same is true in other databases, such as production management and other information management applications [27] , [34] . This problem can be regarded as the problem of sorting vertices of a multidimensional space lexicographically. In this subsection, we give a constant time solution on BSR for the problem. Note that y(logn) time is needed on CREW or CRCW for sorting in a linear space (see Corollary 4.5 of [21] ). Let the n vertices of a k-dimensional space (k ! I) be IXXn vIXXnY IXXk and
then i is less than j lexicographically if and only if
denoted by } i` j, is i. The constant time solution on BSR is given as follows for the problem of sorting vertices of a multidimensional space lexicographically, where sndexIXXn is the result s.t. sndexi is less than or equal to indexj lexicographically when i`j. Algorithm ort kh ; BEGIN for j X I to n dopar j X H; j X AE I } i` j j ; sj X j j À Iaj; j X AE I si sj j ; sndexj X j; pardo; END.
Three Counting Problems for a Relational Database
BSR allows constant time solutions to many operations on a relational database that has n records with k attributes, such as: attributes are equal to (smaller, or larger than) those of the record. L e t eordsIXXn rIXXnY IXXk, fIY PY XXXY kg, and e m k ff j f jfj mg. The BSR constant time solutions to the three problems above are given in the following, where only the case of ªsmaller thanº is considered (the cases of equal to and larger than are derived similarly). Note that the simpler problem of dominance counting (Section 6.5.1 of [21] ) needs y(logn) time on a CREW or CRCW with n processors (see Theorem 6.8 of [21] ).
(1) BEGIN for j X I to n dopar xumerIj X AE I j fPe m k sPf riY s`rjY sY pardo END.
(2) BEGIN for j X I to n dopar xumerPj X AE I j fPe m k sPÀf riY s ! rjY s sPf riY s`rjY sY pardo END.
(3) BEGIN for j X I to n dopar xumerQj X AE I j fPe m k sPÀf riY s ! rjY sY pardo END.
CONCLUSION
In this paper, the simplicity of using BSR to describe parallel algorithms was demonstrated. An efficient implementation for the BI of BSR , the most powerful PRAM, was given based on an earlier implementation of BSR (BSR k , implementation ii)). Since the circuit component in charge of Selection & Reduction in BSR implementation i) is the same as that of BSR implementation ii), another yn P implementation of BSR can be obtained easily based on the BSR implementation i). However, obtaining an ynlogn implementation of BSR remains an open problem. Compared with the direct implementation of BSR k , the indirect implementation of BSR has the following properties:
1. Simplicity, 2. The logic operation for selection is not limited to exh, and 3. The terms for selection are unlimited.
