In this paper four alternative strategies for assigning processors to queries in multiprocessor database machines are described and evaluated. The results demonstrate that SIMD database machines are indeed a poor design when their performance is compared with that of the three MIMD strategies presented.
INTRODUCTION
During the past several years we have been investigating the design and implementation of a database machine for the execution of relational algebra queries. The architecture of DIRECT can be found in [9, lo] . In [ll] we describe query organization and execution in DIRECT and introduce the problem of relation fragmentation and its impact on query execution time. In parallel with these efforts, we have been proceeding with the implementation of a prototype of DIRECT that became operational during the spring of 1980 .
An important problem which arises in multiple-processor database machines is that of processor assignment. If the database machine serves as a back end to a database system that simultaneously serves many users, there will generally be more than one query ready to be executed. Given that each query consists of several relational algebra operations (e.g., restricts, projects, joins, or update operations), there are several possible strategies for assigning processors to these queries for execution. In the following sections we first describe four alternative strategies and then compare and contrast each with respect to several performPermission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. This research was supported in part by the National Science Foundation under Grant MCS78-01721 and in part by the U. ante criteria. The best of these strategies is shown to be a strategy that is based on the application of data-flow machine techniques to the execution of relational algebra queries. This strategy for executing queries yields an average performance improvement of approximately 3: 1 over the SIMD strategy and 1.5: 1 over the other MIMD strategies. Furthermore, we also demonstrate that the impact of moving data pages between mass storage and a shared CCD cache is negligible for the data-flow query processing strategy.
The purpose of this paper is not, however, to compare the performance of DIRECT with a relational database management system on a conventional computer. Such a comparison for a SIMD version of DIRECT and several other database machines-DBC [3] , RAP [17] , CASSM [25] , associative disks [21] , and CAFS [2] -can be found in [13] . A comparison of RAP and a relational database system on a conventional computer is presented in [18] . We believe that this paper demonstrates that the performance of DIRECT is generally superior to that of SIMD architectures. However, the results presented in [ 131 demonstrate that for restrict-only queries, "processor-per-track" architectures perform better than SIMD database machines, which must move data from mass storage into a shared CCD cache before processing it.
Our investigation is restricted to retrieval operations. This was done for a number of reasons: (a) Update operations take a relatively short time to execute (compared to joins) and basically consist of a retrieval operation followed by modification of the selected tuples. (b) When considering update operations, one must provide a concurrency control mechanism (in our case a number of such mechanisms). We feel that while these mechanisms would impact performance in a very real way, they would serve to increase the complexity of the experiments and contribute considerably to hiding the salient features of the different strategies. We fully intend to consider the impact of update operations and concurrency control in the near future.
Section 2 contains an overview of DIRECT. In Section 3 four processor assignment strategies are introduced and described in detail. Section 4 describes the experiments that were conducted to compare the performance of each strategy. The results of these experiments are presented in Section 5.
OVERVIEW OF DIRECT
In this section the hardware and software characteristics of DIRECT are reviewed. For more details the reader is encouraged to examine references [lo] and [ll].
Hardware
The hardware used to implement DIRECT consists of six main components: a host processor, a back-end controller, a set of processors for executing queries, a set of CCD memory modules which are used as pseudoassociative memories, an interconnection matrix between the set of processors and the set of CCD memory modules, and one or more mass storage devices. A diagram of these components and their interconnections can be found in Figure 1 .
The host processor handles all communications with the users. A user who wishes to use DIRECT logs onto a modified version of INGRES [24] and proceeds in the normal manner. When the user executes a query, INGRES compiles the query into a "query packet" and sends it to the back end for execution. Each query packet is comprised of one or more relational algebra operations (instructions) which are organized in the form of a tree. Each node corresponds to a relational algebra operation (actually update operations may be decomposed into a number of nodes) and is in a form that can be executed directly by an arbitrary number of processors without interpretation. Figure 2 contains an example of a query tree for an INGRES query that contains three restrict operations and two join operations. A, B, and C are the source relations referenced by the query. Relations A', B', C', and D are intermediate relations produced during the execution of the query. Relation E contains the result of the query.
The back-end controller is responsible for interacting with the host processor and controlling the processors. After it receives a query packet from the host, it determines the number of processors that should be assigned to execute the packet or an instruction within the packet, and then, as processors become available, assigns them to instructions.
Each processor is a PDP LSI-11/03 with 28K words of memory. The function of each processor is to execute instructions assigned by the back-end controller.
Since DIRECT is a virtual memory database machine, each relation in the database is divided into fixed-size pages. A set of CCD memory modules acts as a disk cache for the mass storage devices and is used by the processors for relation storage during query execution. Each CCD memory module corresponds to a page frame and can hold one page of a relation. The cache is managed using a modified LRU algorithm.
The processors are connected to the CCD memory modules through an interconnection matrix [9, lo] , which has two properties important for supporting both intra-and interquery concurrency. The first property is that two or more processors, each executing the same or perhaps different queries, can simultaneously read the same CCD memory module. The second property is that two or In this section we describe those features of DIRECT that permit alternative strategies for the assignment of processors to instructions for execution. DIRECT is not an SIMD (Single Instruction stream, Multiple Data stream) machine. Different processors can simultaneously be executing different query packets, different instructions from the same packet, or different instructions from different packets. Furthermore, the number of processors assigned to an instruction can be increased or decreased during the execution of the instruction without modifying or recompiling the instruction.
2.2.1 Join Algorithm. In a relational database system one of the most timeconsuming operations that must be performed is the join operation. In [4] , several alternative join algorithms for uniprocessor systems are presented and analyzed. The results show that in the absence of indices (as in DIRECT), a sort-merge algorithm performs best. However, for multiprocessor systems the nested-loops algorithm appears to be superior.
For multiprocessor systems, the nested-loops algorithm works by joining each unit of one (the outer) relation with all of the units in the other (the inner) relation. If a unit corresponds to a page, the outer relation is n pages long, and there are n processors available; then each processor can join one page of the outer relation with the entire inner relation. Therefore, the execution time will be l/nth the time required for a single processor to execute the join. The multiprocessor sort-merge algorithm employs a parallel sort of both relations on the joining attribute. This is followed by a uniprocessor merge on the joining attribute to perform the join.
Intuitively, the nested-loops algorithm should outperform the sort-merge since the amount of parallelism that can be attained is high (limited only by the number of pages in the outer relation) and can be maintained throughout the duration of the execution. When sorting in parallel, one may be able to start with a large number of processors. After each stage, however, the number of processors decreases, whereas the amount of data examined by each processor increases until in the final stage one processor must examine the relation in its entirety. The general superiority of the parallel nested-loops algorithm has been verified by a performance analysis [5] .
The processing of joins using the nested-loops algorithm and the ability to increase or decrease the number of processors assigned to an instruction requires that DIRECT be able to (1) Supply a processor with all pages of the inner relation in sequence. This is the function of the GET-PAGE operator described below. (2) Keep track of which page of a relation (outer relation in the case of a join instruction) a free processor should be assigned to examine, so that the processor does not examine a page that has already been given to another processor. To perform this function, the NEXT-PAGE operator and the Query Packet Task Table are Figure 3 , which implements the restrict operator. It is very important to notice that this procedure can be executed by any number of processors simultaneously. Each processor, through the use of the NEXT-PAGE operator, will work on a distinct subset of the pages of relation relA. Furthermore, it is possible to add additional processors after the restrict is already partially executed. In addition, a processor can be removed before an operation is completed by having the back-end controller send it a fake "end-of-relation" signal in response to a NEXT-PAGE request. The processor will respond by first flushing its output buffer into a CCD memory module and then requesting a new instruction from the back-end controller. Task Table. Since a query packet (or an instruction from a query packet) can be assigned to any number of processors, there must be a way to prevent two simultaneous NEXT-PAGE requests from different processors executing the same instruction from getting the same page. This is handled by the use of the query packet task table, which is shown in Figure 4 . The query packet task table has one entry for each instance of each relation referenced by each executing query. Associated with each entry is a monitor [15] Figure 5 and then increments the pointer by one. When the value of the currency pointer exceeds the size of the relation, all subsequent NEXT-PAGE operations on that entry receive an "end-of-relation"
Query Packet
value. This ensures that processors executing the same relational algebra operation will examine different pages of the source relation.
GET-PAGE
Operation. The GET-PAGE operator is a request from a processor to the back-end controller for the address of the CCD memory module which contains a particular page of a relation. No monitor is needed to coordinate GET-PAGE requests. Figure 5 illustrates the use of this operator to perform a join of relations A and B.
2.2.5 Parallel Update Operations. The following algorithms are employed by DIRECT for the three update operators append, delete, and replace. Deletes are implemented as negated restricts. To perform an append, we execute a variation of a merge. We assume that all pages are sorted on either the key or the entire tuple. The tuples to be appended are placed in as few pages as possible. Each page is then sorted. Each processor executing the append is given a page of the original relation and, in turn, all the new pages. The processor examines both pages and upon finding a duplicate tuple, it deletes the tuple from the old page. Finally, the new pages are added to the relation page table. Replace is implemented as a modified delete followed by an append. In each case the pages of the result relation remain sorted. Each relation must undergo a periodic reorganization if pages become too sparse.
Relation Fragmentation
and the Compress Operator. As discussed in [ll], a problem arises when more than one processor is used to select from a relation those tuples that satisfy a search condition. We call this problem "relation fragmentation." To illustrate, we will use the restrict operator. Assume that the relation being restricted is divided into fixed-size pages and that there is one processor per page. Then, when the search condition is applied by all processors in parallel to the relation, each processor will produce a subset of the new relation. This new relation will contain those tuples that satisfy the restriction. In DIRECT each processor will produce some fraction of a page of the new relation. If this new relation represents the results of the query, this fragmentation is not a significant problem. However, if this new relation is to be used by a subsequent operator (such as a join) in the query, then the degree of relation fragmentation will have a significant impact on the performance of this operator. This performance degradation will occur because any subsequent operator that uses the fragmented relation as an operand will have to read all of the partially filled pages in order to access all the tuples of the intermediate relation.
To solve this problem, DIRECT employs an operator called compression. The function of this operator is to read partially filled pages of a relation and output full ones. In [ll] , we analyze the use of this operator and develop a technique for estimating the optimal number of processors to perform the compression so that the benefits of its employment are maximized and the cost minimized.
QUERY PROCESSOR ASSIGNMENT STRATEGIES
In this section we describe and analyze four alternative strategies for the assignment of processors to queries. These strategies appear to be viable for any MIMD database machine. As we demonstrate in Section 5, each strategy provides a different level of system performance in terms of the execution time of a set of benchmarks that are described in Section 4.
SIMD Assignment
One of the original design objectives of DIRECT was to avoid the SIMD nature of earlier database machines such as RAP. In the SIMD assignment strategy, all processors are assigned to execute the same instruction from a single query simultaneously.
When the current instruction has terminated, the back-end controller assigns the next instruction from the same query packet to all processors. This continues until the packet has terminated, at which point the controller selects the next query packet to execute.
The flaws of this strategy are fairly obvious. If, for example, the size of a source relation for some instruction is 10 pages and there are 100 processors available, then 90 processors will be idle during the execution of the instruction.
Packet-Level Assignment
In this strategy, when the back-end controller decides to execute a query packet, it examines the packet and estimates the "optimal" number of processors to assign to it. Our heuristic estimation uses the number and type of operators in the packet as well as the number and size of the source relations referenced by the instructions in the packet. Once this value has been estimated, it remains fixed throughout the execution of the query.
After the back-end controller estimates how many processors should be assigned to a packet, it selects an executable (enabled) instruction from the packet. An instruction is executable when its input relation(s) exist. Clearly, if the query is in a tree format, all leaf nodes are immediately executable. A node higher up in the tree is enabled whenever all of its descendents have finished executing.
Let QPS represent an estimate of the "optimal" number of processors to be assigned to the query packet. If the instruction selected for execution is a restrict operation, then the controller will assign MIN( IS, I, QPS) processors to the instruction, where 1 Si 1 is the number of pages in S,, the source relation to be restricted. If the operation is a join of relations Si and S;, then MIN(MAX( 1 Si I, I Sj I), QPS) processors are assigned. Selecting the larger of the two relations Si and Sj means that, if MAX( 1 Si I, ) Sj ) ) 5 QPS, each processor will join one page of the larger relation with every page of the smaller relation. This approach maximizes the degree of concurrency and hence minimizes execution time.
At this point, if all the processors assigned to the packet have not been utilized, the next executable instruction from the packet is initiated. This continues until either all the processors assigned to the packet are executing some instruction from the packet, or until no more executable instructions are available (since their inputs have not been generated). If no more executable instructions exist, the available processors are placed on an idle list associated with the packet until an instruction is enabled. Idle processors are not reassigned to another packet before the packet terminates.
If, when the first instruction in a packet is initiated, the number of processors available is less than QPS, the packet is still initiated. Then, when another executing packet terminates, the additional processors are assigned to the suboptimal packet. At this point the processors will either be assigned to an instruction that is below its optimal level or, if no suboptimal instructions exist, to another instruction in the packet that is executable but has not yet been initiated.
3.3
Instruction-Level Assignment
For this strategy, scheduling and processor assignment is performed on an instruction-by-instruction basis. The optimal number of processors assigned to an individual restrict or join instruction is limited only by the number of processors available. If the total number of processors available is MAXQPS, then for a restrict operations QPS = MIN (I Si I, MAXQPS) and for a join QPS = MIN(MAX( I Si I, I Sj I), MAXQPS). When a processor becomes idle, the back-end controller first attempts to assign the processor to any executing instruction that does not have its optimal number of processors. If no suboptimal instructions exist, the processor is assigned to an enabled instruction from a query packet that is currently being executed. If there are no enabled instructions from the currently executing packets, then a new packet is initiated. If there are no packets awaiting execution, then the processor is placed on an idle list until a new packet arrives from the host or an instruction is enabled.
Data-Flow Assignment
This strategy is a variation of the instruction-level assignment strategy, except for the criterion that is used to enable instructions in the query packet tree. In the following sections we first explain the basic concepts of data-flow machines, and then we describe processor assignment and query execution in the data-flow assignment strategy.
3.4.1 Data-Flow Machines. A data-flow machine is an architecture devoid of a program counter where instructions are enabled for execution as soon as their operands are present. Such a machine consists of a memory section, a processing section, and an interconnection device between the memory section and the processing section. A memory cell contains an instruction and room for the operand data. As soon as all the required data are present, the contents of the cell are sent to some processor for execution. This frees the cell for the execution of the next instruction. Output from the execution of an instruction is sent via the interconnection device to one or more memory cells, possibly enabling one or more instruction(s) in the destination cell(s).
Data-flow architectures seem particularly attractive when we consider the fact that many programs contain a significant degree of inherent parallelism that cannot always be exploited by conventional machines or even multiprocessors. Furthermore, in order to exploit the parallelism, extensive analysis of the program must be performed by a compiler. In a data-flow machine this is unnecessary since the notion of sequentiality in the execution of programs does not exist.
Various architectures for data-flow machines have been proposed [l, 7, 8, 191 . These architectures differ from each other in many ways. One difference is the granularity of the operands and the types of operations that the processors execute. For example, Dennis and Misunas [8] talk about assigning such instructions as add and multiply to the processors, whereas Rumbaugh [19] assigns entire procedures to processors.
For data-flow database machines there are also several alternative variable granularities for enabling relational algebra operators in the query tree. That is, the basic variable on which processors operate can be the whole relation, a fragment of a relation, or a single tuple.
Thus the strategy described in Section 3.3 is basically data flow. However, because of the memory unit size and the processor capabilities in DIRECT, any advantage claimed for a data-flow strategy will not show. A strategy using a fragment of a relation-in this case a physical page-is the fourth strategy we investigated. Its description is presented below.
Data-Flow Assignment Using Page-Level Granularity.
In this strategy a page of a relation is the basic unit on which processors operate and which is used for scheduling decisions. This means that an instruction can be initiated as soon as at least one page of each participating relation exists. Assigning processors to operate on pages rather than relations offers the possibility of having a very flexible processor allocation strategy. Furthermore, it becomes possible to distribute processors across all nodes of the query tree and to pipeline pages of intermediate relations between them. This will reduce page traffic between the CCD memory and the mass storage device(s) because soon after a page of an intermediate relation is produced by one processor it will be consumed by a processor executing the subsequent operator.
The processing of queries in a data-flow fashion is related to the idea of processing relational queries in a pipelined fashion, which has been suggested by ACM Smith and Chang [22] and Yao [26] . There are, however, several important differences between the two strategies. In the pipelined strategy there will be at most one processor executing each node in the tree; therefore, the concurrency obtained will be limited by the number of nodes in the query tree. With the dataflow strategy we can have any number of processors executing each node and can dynamically adjust which processors are executing which nodes in the query tree in order to maximize performance. The other major difference is that in the dataflow strategy we never need to wait for one node to completely finish before initiating the subsequent operator, as has been suggested is necessary for pipelining [26] . We clearly want to allocate a large number of processors to the restrict producing the outer relation of the subsequent join. This will cause a large number of output pages to be produced. Therefore a high level of concurrency can be attained in the execution of the subsequent join by assigning a processor to each page of the outer relation. Each will then join its page with the entire inner relation.
On the other hand, since each page of the outer relation will be joined with the entire inner relation, the inner relation will be resident in the CCD memory for the duration of the join and, consequently, should occupy as few pages as possible. In the SIMD, packet-based, and instruction-level strategies a large number of processors can be allocated to the restrict producing the inner relation in order to minimize its execution time. Then, after both relations have been completely computed, the inner relation is selected and compressed to a manageable size using the compress operator.
Instead of using the compress operator with the data-flow strategy, we employ the following technique for generating the inner relation. Since we want pages of the inner relation to be consumed by the processors executing the join fairly soon after they are produced (so that they are not paged out of CCD memory), and since joining two pages takes considerably longer than restricting one page, we want pages of the inner relation to be produced at a slower rate than pages of the outer relation. Therefore, by allocating a small number of processors to this restrict, we can achieve the goal of producing a small number of relatively full pages without employing the compress operator.
DESIGN OF AN EXPERIMENT TO COMPARE THE FOUR STRATEGIES
In order to evaluate the four processor assignment strategies and to uncover any unforeseen problems with the design of DIRECT, a detailed simulation of DIRECT was implemented. (Actually there are four variations, one for each strategy. However, since the differences are minor, we subsequently refer to them as one.) This simulation permits us to make detailed performance measurements on all aspects of the system, including both resource utilization (processors, CCD memory modules, and secondary memory) and software performance (query execution and memory management). In Section 4.1, the hardware and software characteristics we assumed are described. Then, in Section 4.2, the set of tests used to evaluate the different processor assignment strategies are outlined. (1) the time required to transfer a page between a CCD memory module and the local memory of a processor; (2) the time required to determine whether a tuple from a relation satisifes a selection criterion; (3) the time required to join two tuples.
The time to transfer a page between a processor's main memory and a CCD memory module was assumed to be 33 milliseconds, based on an LSI-11 Q bus bandwidth of 0.5. lo6 bytes per second [12] and a page size of 16K bytes.
The following restrict algorithm is employed by a processor on a page of a relation. First, a pointer is set to the appropriate attribute of the next tuple. Since all tuples in a given DIRECT relation have the same fixed length, this operation takes a constant amount of time. Next, the attribute is compared against the restrict criterion. We assume that each attribute is a character string and that if the tuple does not satisfy the restrict criterion, three-tenths of the characters in the attribute are examined before a match failure occurs. Otherwise, the full attribute is examined. When a tuple satisfies the search condition, the desired attributes are moved character by character to an output buffer in the processor's main memory.
A processor assigned to join two pages uses a sort-merge algorithm [4] . We assume that, on the average, three tenths of the characters of the joining attributes from both tuples are compared before a failure is determined. If the join is successful, the desired attributes are selected and moved to an output buffer. If the output page is to be used by a subsequent operator in the query, then the processor sorts the page on the attribute the subsequent operator uses before writing it to the CCD memory. By having each processor sort its output pages (instead of sorting both its input pages), each page of an intermediate relation is only sorted once.
CCD Memory Modules and Interconnection
Matrix. The bandwidth of an individual memory module was assumed to be 2 megabytes per second, based on INTEL 2314 CCD chips. This implies that a 16K byte page could be transferred into (from) a CCD memory module in 8.2 milliseconds if the transfer rate is not limited by the LSI-11 Q bus or the disk transfer rate (see Section 4.1.3). Furthermore, we assume that the interconnection matrix does not impact CCD memory performance (see [9] , [lo]).
Each of the four simulations uses identical algorithms for managing the set of CCD memory modules. The algorithms represent a combination of classical memory management techniques (e.g., use of dirty bits, least recently used bits) along with techniques that exploit the reference string generated by the parallel nested-loops join algorithm. For example, the CCD management algorithm avoids, if at all possible, ejecting the page that is being used as the inner relation in a join currently being executed until all processors that are executing the join have seen the page. 4.1.3 Muss Storage Devices. Since DIRECT is a virtual memory machine, pages of relations that are not being referenced by an active query packet are resident on one or more mass storage devices. When a page is subsequently referenced, it is loaded into a CCD memory module. IBM 3330 disks were used as the model for our mass storage devices. The transfer time for a 16K byte page is 20 milliseconds. The time to seek one track is 0.148 millisecond and the latency time is 8.4 milliseconds. We assume that there are two disks available for relation storage and swapping.
The simulation services disk requests in a first-come-first-served order. Therefore, when a processor issues a request for a page of a relation, the length of time required to satisfy the request is dependent on the other disk activity. The directory structure used in the simulation models each relation as occupying a set of adjacent tracks on one disk. The pages of a relation are first spread across all tracks in one cylinder before the next cylinder is used in order to minimize seek times.
Experiment Design
The database used to evaluate the four processor assignment strategies consists of 15 relations. The size (in pages) of each relation was randomly chosen from an exponential distribution with a mean size of 23 pages, minimum relation size of 1 page, and a maximum relation size of 100 pages. The tuple length of each relation was chosen from an exponential distribution with a mean size of 55 bytes, a minimum size of 10 bytes, and a maximum size of 100 bytes. For each restrict operator the fraction of tuples that satisfy the restrict condition was chosen from an exponential distribution with minimum of 0, maximum of 1, and mean of 0.125 (chosen to produce result relations with a reasonable size). For each join the fraction of tuples (of the product of the number of tuples in both pages) that satisfy the join qualification was selected on the basis of an exponential distribution with a minimum of 0, maximum of 1, and a mean of 0.0035 (again chosen to produce reasonably sized result relations).
Six different sets of queries were chosen to evaluate the alternative processor allocation strategies. Classes I-IV each contain five query packets and correspond to a range of overhead-intensive queries (Class I) to execution-intensive queries (Class IV) [14] . The tests Mix I and Mix II each contain 10 query packets and represent what we feel are a reasonable mix of the different classes of queries. Table I summarizes these six different experiments.
It is felt that Classes II and III contain the types of queries that are typically performed by users in accessing relational databases. This is why Mix I and Mix II include a high percentage of queries from these two classes. If views are supported by the relational database system and queries are modified according to the view, it has been observed [16] that it is not unusual for a modified query to contain five to seven join operations. Therefore, the results of Class IV may be as significant as the results of Mix I and Mix II, because Class IV contains a large number of joins in each query packet.
EXPERIMENT RESULTS

Establishment of a CCD Memory Module to Processor Ratio
The first test performed served two functions. Its primary function was to establish an appropriate ratio of CCD memory modules to query processors. Once this value was established it would be used in all subsequent tests. The second function of this experiment was to determine how the performance of each processor assignment strategy is affected by this ratio. We felt that this should provide some indication about how efficiently each strategy uses the CCD memory modules available.
To perform this experiment, we fixed the number of processors available at 50 and then ran test Mix I on all strategies for five different CCD memory sizes: 50, 100, 150, 200, and 250. Figure 6 contains the results of this experiment. As the reader will notice, the performance of the SIMD, packet-level, and instructionlevel strategies continues to improve as the number of CCD memory modules increases. If the number of CCD modules is increased beyond 250, this trend continues until enough modules are present so that pages from source, intermediate, and final relations never have to be ejected to secondary memory. The data-flow strategy, on the other hand, is not significantly affected by the number of CCD memory modules present. This result seems to indicate that this strategy indeed succeeds at using pages from intermediate relations before they are paged out. This saves a write operation to mass storage followed by a read operation when an intermediate relation is subsequently accessed.
We feel that the reason that the packet-level strategy is less affected by an increase in the number of modules available than the inntruction-level strategy is that it is less flexible and hence has better locality properties than the instruction-level strategy. While this argument should also apply to the SIMD strategy, the results indicate otherwise. Therefore, there may be another, yet undiscovered, reason why the packet-level strategy behaves this way. Although not presented, similar results were obtained at several other levels of processors and for other tests. Thus, as a compromise between the thriftiness of the data-flow strategy and the greediness of the SIMD, packet-level, and instruction-level strategies, we chose a CCD memory module to processor ratio of 2: 1. This ratio was used in all subsequent tests. Therefore, in the results presented below, if there are n processors available there are 2*n CCD memory modules available.
Analysis of the Simulation Results
Using this 2: 1 ratio, each of the six tests (Classes I through IV and Mixes I and II) was executed using each alternative strategy for a range of available processors We interpret these graphs below. When examining these graphs, the reader should be aware that the schedule of packets and instructions for any given processor level is not necessarily an optimal schedule. At any given point, when the back-end controller makes a decision regarding which instruction a processor should be assigned to or which page should be ejected from CCD memory, it chooses the "best" option. This local (immediate) optimization does not always produce an optimal schedule. Consequently, certain anomalies can occur in the results. remains constant as the number of processors is increased from 30 to 40. It is simply the case that the back-end controller made a decision that turned out in the long run to be a bad decision. One obvious result from these experiments is that the SIMD strategy always performs significantly poorer than all the other strategies. Our interpretation of this is that any "caching" database machine (a database machine that first must move data from mass storage to a faster memory before processing can begin) that does not employ one of the MIMD strategies is most likely a poor design.
The principal result is the superiority of the data-flow strategy for processor allocation. If Mix I and Mix II are taken to be representative of typical query mixes, then, for a given number of processors and CCD memory modules, the data-flow strategy is approximately 3 times as fast as the SIMD strategy, 1.7 times as fast as the packet-level strategy, and 1.3 times as fast as the instructionlevel strategy.
Initially the relative performance of the data-flow and instruction-level strategies on the tests Mix I and Mix II was somewhat puzzling because the data-flow is only marginally superior to the instruction-level for Classes II and III (which comprise a significant portion of the tests Mix I and II). As an explanation we hypothesized that the data-flow strategy, because it is an advanced form of pipelining, tends to utilize the CCD cache more efficiently (this hypothesis is partially verified by the results presented in Figure 6 ). Further reflection on the problem led to the observation that the queries in Class II are not as likely to benefit from the effects of pipelining, since each query contains only three operators. However, the performance of Class III, in which each query has five or seven operators and consequently should benefit from the pipelining characteristics of the data-flow strategy, seemed to contradict the hypothesis. In an attempt to verify the hypothesis we tried two experiments. The first was to increase the number of queries in Class III from five to ten (we also enlarged the number of relations in the database so that effects of two queries referencing the same relation were minimized). For this variation of test Class III, the data-flow strategy was 11.6 percent faster than the instruction-level strategy (in the original Class III, the data-flow strategy was only 2.3 percent faster). This seems to indicate that the original Class III test simply did not generate enough CCD activity for the benefits of pipelining to appear. The second experiment we conducted to test this hypothesis is presented in the following subsection.
A surprising result -is the relatively good performance of the packet-level strategy when compared with the instruction-level strategy. If we assume that Mixes I and II are representative test sets, the performance of the instructionlevel strategy is 24 percent better than that of the packet-level strategy for Mix identical because each query packet contains only one instruction. We feel that the relatively high level of performance demonstrated by the packet-level strategy occurs because it induces less thrashing of data between the CCD memory and mass storage. In the packet-level strategy, as a processor finishes executing an instruction, it is either reassigned to another instruction in the same packet or is left idle until an instruction in the packet is enabled. Under identical conditions the instruction-level strategy may assign the free processor to an instruction from another packet or even initiate a new packet. Executing this new instruction will probably result in pages from secondary memory replacing pages currently in the CCD memory. As a consequence, when an instruction from the original packet is finally enabled, its operands will most likely have been paged out. 
Effect of Swapping on Performance
The next experiment we conducted was designed to determine why the data-flow strategy was superior to the instruction-level strategy for Mixes I and II and at the same time to measure the impact that swapping pages between CCD memory modules and secondary memory has on query execution time. We felt that this was a very significant experiment because it has been argued that database machines that use paging will always be I/O bound [23] .
To determine the effect of swapping on system performance, we modified the simulations for the data-flow and instruction-level strategies so that the time to transfer a page between a CCD memory module and a disk is 0 millisecond. In I I  I I  I I  I I  I I  I I  I I  I I  20 20  30 30  40 40  50 50  60 60  70 70  80 80  90 90  100 100 NUMBER OF PRBCESSBRS NUMBER OF PRBCESSBRS this way it appears that the bandwidth of the channel and disk is infinite. The size of the CCD memory is still limited, however, by the 2: 1 ratio. Then we reran test Mix I for the data-flow and instruction-level strategies. Figures 13 and 14 present the results of this experiment.
For the instruction-level strategy ( Figure 13 ) the improvement, averaged over all processor levels, is 39.5 percent. Thus swapping has the effect of decreasing system throughput slightly more than one-third. While significant, we feel that the overhead of swapping is not as high as expected. For the data-flow strategy the improvement averaged over all processors levels is 18.4 percent. This figure clearly indicates that virtual memory database machines can be organized in such a way as to avoid being I/O bound as has been claimed.
The difference in improvement shown by the two strategies in the "infinite disk" case is 21 percent. This value is approximately the same as the difference between the instruction-level and data-flow strategies for the tests Mix I and II, as shown in Figures 11 and 12 and seems to explain the primary reason why the data-flow strategy is superior.
Effect of Database Size and Query Processor Speed
The final two experiments attempted to determine the sensitivity of the results we have presented to the two parameters that we viewed as crucial: database size and instruction execution time of the processor.
To determine the sensitivity of the instruction-level and data-flow strategies to the size of the database being accessed, we modified test Mix I by doubling the size of each relation in the database. On the average, the data-flow strategy is 22.1 percent faster than the instruction-level strategy. Somewhat better results were obtained with the other tests on this and other enlarged (enlarged in both the number of relations and the average size of each relation) databases. In general, as the database referenced increases in size, the percentage improvement of the data-flow strategy over the other strategies tends to increase for most tests (see Section 5.2 for another example of the impact of database size). Again, we feel that the reason for this increase is the improved CCD management exhibited by the data-flow strategy.
The second sensitivity experiment we performed was to double the speed of the processor by cutting in one-half the execution time of each instruction and doubling the DMA transfer rate. For Mix I, on the average the data-flow strategy was 35.5 percent faster than the instruction-level strategy (an increase of 11 percent over the difference between the two strategies for Mix I with the normal processor speeds). This result seems to demonstrate again the effect of pipelining in the data-flow strategy. However, compared to the results with normal processor speeds, the execution time decreased by only 21 percent for the data-flow strategy and only 6 percent for the instruction-level strategy. This implies that given the present hardware components performances, increasing the performance of one component (in this case, the processor speed) does not mean that system performance will increase by the same factor.
Message Activity
While we have not yet modeled the back-end controller requirements for each strategy, we have modeled the message activity of each strategy. Each operator that is sent to a processor is counted as one message. In the SIMD strategy, the distribution of the operator to all processors is counted as only one message since we assume that such a database machine would have a broadcast capability. Each relation page request executed by a processor is counted as three messages: one to the back-end controller to make the request, one to the processor from the controller containing the CCD memory module number, and another from the processor to signal that it has read (written) the memory module so that the controller can free the page frame. When "end-of-relation" is received on a NEXT-PAGE or GET-PAGE request (i.e., there are no more pages available), only two messages are exchanged.
In Figure 15 we have plotted the number of messages sent between the set of processors and the back-end controller for all strategies running test Mix I. As expected, the data-flow strategy has the highest message traffic. Reflection on the results provided an explanation. Each of the other three strategies uses the compress operator to solve the problem of relation fragmentation. The increase in the number of messages required to do the compress (the number of messages to do a compress is linear in the size of the relation to be compressed) is more * H. Boral and D. J. Dewitt than offset by the resulting decrease in the number of messages required during the execution of the subsequent join (the number of messages to do a join is quadratic in the size of the two relations to be joined). The data-flow strategy does not employ the compress operator but instead attempts to minimize the size of the inner relation by reducing the number of processors producing that relation. However, the typical inner relation size is generally two or three times larger than that of a compressed inner relation. Consequently, the subsequent join runs slower (because each processor must read several sparsely filled pages instead of one mostly full one), and the number of messages is higher.
CONCLUSIONS AND FUTURE RESEARCH
In this paper we have described and evaluated four alternative strategies for assigning processors to queries in multiprocessor database machines. As originally predicted in [9] , our results demonstrate that SIMD database machines are indeed a poor design when their performance is compared with that of all the MIMD strategies we have presented.
We have also introduced the application of data-flow machine techniques to the processing of relational algebra queries. Except for the number of messages required, the data-flow strategy is either superior or as good as all other strategies we examined. Furthermore, our results indicate that a two-level storage hierarchy (in which relations are paged between a shared data cache and mass storage) does not have a significant impact on performance if the data-flow query processing strategy is employed. Since the data-flow strategy is a generalization of the pipelined strategy, our results verify the expected performance of this strategy, as was anticipated by Smith and Chang [22] and Yao [26] .
One significant problem exposed by this research is the high level of message traffic activity. For all strategies the amount of message traffic that must be supported is relatively high. If 8000 back-end controller instructions are required to process each of the 7500 messages passed in executing Mix I for 10 processors (a figure derived from UNIX pipe code efficiency [6] ), then 60 million instructions will be executed just to process the messages. If each instruction takes 1 microsecond, then 60 seconds will be required to process the messages (60 seconds are also required to execute the queries for the data-flow strategy). It is important to notice that any decrease in query execution time by using additional processors may be offset by the increased time required to process messages in the back-end controller. For example, using the data-flow strategy and 50 processors, 27 seconds are required to execute the queries and 178 seconds will be required by the backend controller to process the messages.
There are three potential solutions for this problem. The simplest is to increase the page size. Increasing the page size by an order of magnitude should decrease message traffic by a similar factor. However, there may also be a decrease in the maximal degree of concurrency possible. Another solution is to attempt not to reduce the volume of messages but rather to reduce the cost of processing the messages by implementing message-handling software in microcode on the backend controller. This could help significantly. The ultimate solution is to decentralize the back-end controller. We have completed a preliminary design for a new data-flow machine architecture for processing relational algebra queries [S) and are currently engaged in the design of an architecture based on this organization in order to support the data-flow assignment strategy efficiently.
