Abstract-DIRECT is a multiprocessor database machine designed and implemented at the University of Wisconsin. This paper describes our experiences with the implementation of DIRECT. We start with a brief overview of the original machine proposal and how it differs from what was actually implemented. We then describe the structure of the DIRECT software. This includes software on host computers that interfaces with the database machine; software on the back-end controller of DIRECT; and software executed by the query processors. In addition to describing the structure of the software we will attempt to motivate and justify its design and implementation. We also discuss a number of implementation issues (e.g., debugging of the code across several machines). We conclude the paper with a list of the "lessons" we have learned from this experience.
Britton-Lee, Inc. [6], CAFS from ICL Ltd-.
[1], and DIRECT. Since both the IDM 500 and CAFS are commercial database machines it is highly unlikely that we shall see any published information about the implementation experiences of either machine-in particular, a candid admission of the errors made and/or the success of methods used. Thus, we feel that a description of the implementation decisions that we made and a summary of our experiences will most certainly aid future database machine implementors in their work.
DIRECT has been operational since early in the Spring of 1980. Presently (Summer 1981) , users can interact with DIRECT through an INGRES [12] interface. DIRECT supports all the relational operators (as in INGRES) except aggregate operations and appends. Future plans for DIRECT include incorporation of aggregate operations, expansion of the hardware, and an empirical perfonnance evaluation.
In Section II, we present an overview of the design of DIRECT and describe the present implementation. Section III describes the design of the DIRECT software. In Section IV, we present the actual implementation techniques used.
Finally, our conclusions and a discussion of what we have learned by implementing DIRECT is contained in Section V.
II. DIRECT SYSTEM ARCHITECTURE

A. Background
The DIRECT project began in 1977 as a consequence of several loosely related events: experience with INGRES on a PDP 11/45, exposure to the original paper describing RAP [9] , and the acquisition of five LSI 11/03 processors through the National Science Foundation Research Equipment Program.
Our experiences with INGRES led us to feel that a backend database machine for INGRES (or an INGRES-like DBMS) could greatly enhance performance. While we felt that RAP could indeed serve as a back-end database machine and improve the performance of a DBMS it suffered from a number of problems. RAP.-1 (the version of RAP described in [9] ) has a single instruction stream, multiple data stream (SIMD) architecture. The primary implication of this organization is that only a single instruction can be executed at a time. Thus, performance improvements over a conventional DBMS can only be achieved through the use of intra-instruction parallelism. An additional, unrelated, problem is that the entire database in RAP.1 must reside on a fixed head disk. We felt that these two factors would severely limit the performance of a large-scale implementation of RAP. One of the goals of the design of DIRECT was to allow for the simultaneous execution of a number of instructions, possibly from different queries. We felt that such a capability is necessary in order for a database machine to support a high volume of transaction processing for a given time unit. Another design goal was to allow for the execution of queries on databases (or portion of databases) of arbitrary size.
B. Original DIRECTArchitecture
As originally conceived, DIRECT consisted of six main .components.
1) A number of host processors with which users interact. Each host provides a user interface and a number of data management functions (e.g., query compilation).
2) Some number of mass storage devices on which the database resides.
3) A set of processors (termed query processors) responsible for the execution of relational operations on the database. 4) A set of memory modules constructed from charge coupled device (CCD) chips that are used as a shared disk cache for the query processors.
5) A crossbar switch connecting the query processors to the CCD modules and the mass storage devices to the CCD modules.
6) A back-end controller responsible for communication with the hosts and controlling the query processors, CCD memory modules, and mass storage devices.
A diagram of a sample DIRECT configuration is shown in Fig. 1 .
User queries on a host are translated into a tree format, then compiled and forwarded to the back-end controller. (An example of a query in a tree format can be seen in Fig. 5 .) The back-end detennines the "optimal" number of processors that should be assigned to the query. Processors are assigned to an instruction (an instruction corresponds to a relational operator) when it becomes enabled, i.e., when its input data exists. Thus, leaf instructions in the compiled query tree are immediately, enabled. Nonleaf nodes must wait until their children have produced some or all of their output data.
DIRECT was designed to support both intra-and interinstruction concurrency. To allow for concurrency within an instruction each relation is divided into fixed size pages. During the execution of an instruction a query processor will operate on one page of a relation at a time. To ensure that each processor examines the correct subset of pages, assignment of pages to the processors is centralized and performed by the back-end. When a processor is ready to examine another page it requests the "next" page from the back-end. The back-end responds with the address of a CCD module containing a page to be examined. By maintaining various control tables in its memory the back-end can ensure the correct action of a number of query processors executing the same code in parallel.
Inter-instruction concurrency also requires management by the back-end. In this case care must be taken that at the end of the concurrent execution of two or more instructions the database is in a consistent state. Since each processor must request a cache address. from the back-end before it actually examines the page, consistency can be guaranteed by the back-end.
Another task for which the back-end is responsible is overseeing the transfer of data from the mass storage devices to the shared cache. This is done either in anticipation of a request or in response to one. Anticipatory paging in DIRECT can be effective because the reference strings are known in advance (a propery of the algorithms used). If indexes were used then the reference strings would be constructed dynamically, precluding anticipatory paging.
The shared cache consists of several CCD memory modules. Each module holds 16 Kbytes of data (the size of a page of a relation). The memory modules are connected to the query processors by a crossbar switch that has the following two properties.
1) Two or more processors can read the same cache frame (memory module) simultaneously.
2) Two processors can read or write two distinct cache frames simultaneously.
In DIRECT, the memory modules are the active units in a data transfer operation while the processors are passive. This means that a page stored in some cache frame can be broadcast to any number of processors. Details of the crossbar switch are in [4] .
A final comment about DIRECT is that its organization fails into the multiple instruction stream, multiple data stream (MIMD) category. Although DIRECT could operate in MIMD mode, and in past publications we have referred to it as an MIMD machine, it in fact operates as a reconfigurable multiple SIMD machine. That is, typically more than one processor is assigned to the execution of an instruction and at a given time instance more than one instruction will be active.
C. Current DIRECTArchitecture
The current DIRECT configuration is shown in Fig. 2 In a balanced system', at any given instant not all the processors will require access to the multiport memory.
III. DESIGN OF THE DIRECT SOFTWARE
In this section, we describe in some detail the host, the backend, and the query processor process structures. While the host software is a modified version of INGRES, the back-end software was designed to process relational queries with a maximum degree of parallelism. Each query processor has some resident code (a primitive kernel) and executes instruction packets received from the back-end. We describe how the code for the query processors is generated and how a query processor operates during the execution of a code packet. 3Logically, the disk is connected to both the bus and the multiport memory. Physically, a processor is used.
4In the text we shall refer to it as the control bus. The version of INGRES which formed the basis for the DIRECT host software runs as five processes as shown in Fig.  3 . The monitor process is responsible for interacting with the user. Once a query has been entered by a user, it is sent by the monitor to the parser process. This process translates the query, using information from the-schema, into a binary tree fonnat. If the query is not a utility query (e.g., print a relation) it is executed by' the decomposition and OVQP (one variable query processor) processes (see [13] for more details). Finally, utility commands (to print, create, destroy, etc. relations) are executed by the DBU (database utility) process.
Since query execution is conveniently isolated as two processes, implementing the host software for DIRECT involved simply eliminating the decomposition and OVQP processes. Fig. 4 
B. The Process Structure of the DIRECTBack-End Controller
Once the functions of the host had been clearly defined, we were able to assume that the back-end machine would receive queries compiled into "packets" of instructions. Upon receiving a packet, the back-end becomes responsible for controlling the packet execution and for returning the result of the query to the host. An analysis of the services the back-end controller would have to provide revealed three main functions: catalog management, packet and instruction scheduling, and memory management. The first function requires that the back-end execute utility programs such as create and destroy a relation, or print a relation. The second function consists of controlling the execution of a packet: determining which instructions in the packet are executable and allocating a certain number of available processors to enabled instructions. This function is similar to the role of a scheduler in an operating system. The third function is the management of the shared cache. Here, the back-end processor plays the role of virtual memory manager in an operating system.
To some extent, these functions are independent of each other and services in each of the above three categories could be provided simultaneously. Thus, there is room for some parallel activity in the back-end controller itself. We felt that by implementing each of the three functions as a separate process we would be able to measure the actual degree of parallelism possible in the back-end. Then, if additional hardware became available, we would move each process to a separate computer and exploit the parallelism. The three processes were named UTIL, PKT, and MEM, respectively.
By maintaining appropriate data structures (such as relation descriptors and task descriptors) and by exchanging synchronization messages among themselves and with the query proces-, sors, the three processes are able to supervise the query processors. In order to efficiently manage the limited amount of primary memory available on the back-end, each process was allocated a "heap" from which the data structures were al located dynamically as needed during execution of a query. 536 Unfortunately, C (the implementation language) did not provide us with adequate support and we had to implement the heap as a fixed size array and write our own heap management routines.
To illustrate the functions of the back-end processes more clearly we shall describe in detail the execution of the query shown in Fig. 5 . Some of the data structures and the main messages exchanged during the execution of this query are illustrated in Fig. 6 .7 Each message is shown as an arrow between the box representing the sending process and the box representing the receiving process. In addition, each arrow points to the data structure to be updated upon receipt of the message. We have also shown (in the case of the GET_PAGE and NEXT_PAGE messages) the paths followed across data structures to reach the desired information.
Relations R and S, in Fig. 5 , are permanent relations in the database. Both relations are restricted resulting in the temporary relations A and B, respectively. Relation A is projected and the result (relation C) is joined with relation B to form relation D. The query packet generated by the CIDI process (see Section 111-A) will be sent to the PKT process. PKT will first send a message to UTIL requesting the creation of the four temporary relations A, B, C, and D. Next, it will create a precedence matrix ("PPM" in Fig. 6 ) showing the dependencies between the instructions.
For each enabled instruction (one whose inputs are available) PKT will request statistics information from UTIL, such as relation size in pages. This information is used in determining the number of processors that should be allocated to the instruction. Since the projection and join depend on output generated by the two selections, only the selections are enabled. Thus, PKT only requests information about relations R and S. Upon receipt of the information both selections are ready to be executed. Ready instructions are placed on the READY_LIST. When a number of query processors (possibly less than the "optimal" number, and perhaps only one) become available, an instruction is removed from the READY_LIST, added to the EX_LIST, and initiated.' PKT sends an assignment message to each allocated processor over the control bus.
This message includes the compiled-code to be executed (see Section III-C). MEM is also informed by PKT each time a processor is assigned to an instruction.
MEM uses a data structure known as the processor control block ("PCB" in Fig. 6 ) to maintain information on the activity of each processor. Additional data structures, associated with executing instructions are maintained so that an instruction excution can be monitored.
A query processor executing an instruction requests pages to be examined from MEM. Such messages are exchanged over the control bus. There are three types of page requests (de7This figure is not complete. It is merely intended to sketch out some of the activities in the back-end to aid the reader in reading the text. A much used detailed figure (known as "the map") exists for internal documentation purposes. 8There are a number of considerations used in selecting which of the two selections should be initiated first. In this paper we do not concern ourselves with this question since it is peripheral to the main subject. scribed in more detail in [4] ). These are: NEXT_PAGE, GET-PAGE, and NEW-PAGE. A NEXT_PAGE request is used when a page of a relation is to be processed by a single processor, as in the selection operation. A processor requesting a NEXT-PAGE gets any page of the relation which has not yet been examined by another processor. A GET_PAGE request is used when a processor is to examine a specific page (for example, one part of a stream of pages). Finally, aNEW_PAGE request is used when a processor is about to write a new page of a relation.
Note that MEM is truly a virtual memory manager. Pages residing in the multiport memory are grouped by classes depending on the operator that is currently using them. Classes are ranked according to priority. In the event that all the pages in the cache belong to the same class a modified LRU algorithm is employed to pick a candidate to be swapped out. MEM responds to GET_PAGE and to NEXT_PAGE requests with the address of the page frame containing the page to be examined by the requesting processor. The processor then initiates a memory access to that frame in the multiport memory. In the event of a NEW_PAGE request, MEM must find an empty page frame in the multiport memory and send its address to the requesting processor. At times this may necessitate writing a page from one of the cache frames to disk. MEM also updates the data structures associated with the requesting processor and the instruction after each request.
When all of the relation pages have been examined MEM responds to NEXT_PAGE and GET_PAGE requests with an "end-of-relation" message. Upon receipt of this message a query processor flushes its output buffer by executing a NEW_PAGE and informs PKT that it is done. PKT, by examining a data structure associated with each instruction, can determine,when an instruction has completed (i.e., all the processors assigned to the instruction have terminated). At For the query of Fig. 5 , termination of the selection operation R-+A will enable the projection. (Note that the join is not enabled until both the second selection and the projection have terminated.) PKT requests statistics on relation A from UTIL. Upon receipt of the reply the projection is added to the READY-LIST. The subsequent actions taken by the backend processes during the execution of the projection are similar to those described above. When the projection terminates the join is enabled.
After the execution of the join instruction, PKT sends a message to UTIL requesting that it print the result relation (that is, send it to the host for printing) and destroy the temporary relations created for the execution of this packet. PKT also informs MEM of the termination of the query packet and both processes destroy the data structures used to manage the packet execution.
Soon after implementation began we decided to combine MEM and UTIL into a single process. Both processes manage and use portions of the schema. In order for the two processes to run separately each would have to inform the other of any changes made. Since UTIL was expected to modify the schema relatively infrequently compared to MEM it seemed that the majority of the messages from MEM to UTIL informing it of changes to the schema would be wasted. We felt that the overhead associated with sending these messages would far offset any gains due to concurrent execution of the two processes.
C. Code Generation
In addressing the question of how to execute queries on the query processors, we were faced with two options: we could either have an interpreter for the query tree in each query processor and interpret instructions; or we could compile (on the host) the query tree into machine language for the query processor and execute the queries directly. Having chosen the compiled approach, our next task was to decide precisely what routines should be resident in a query processor and what should be sent in a query packet. For example, because the restrict algorithm is essentially the same regardless of the relation referenced or the actual selection criterion, it would have been possible to include a restrict code "template" in the query processors. The template would consist of an outer loop which opened the relation and executed repeated NEXT-PAGE calls. An inner loop would sequence through the tuples of a page and apply the selection criteria. In this approach, all that need be sent from the host is a "compare" subroutine to apply the selection criteria specified in the query and a "copy" subroutine to move each selected tuple to the result page buffer (of course, we would also send an instruction header which identified the instruction, the relation, etc.). Similarly, there could be permanently resident join, aggregate, projection, . . ., templates.
We did adopt the template approach but rather than include templates for each possible instruction in all the query processors (which, again, would have required too much memory), we decided to combine the query-and relation-dependent compare and copy subroutines with the code templates on the host as part of query compilation. This saved buffer space in the query processors since we only need to reserve space for the largest code template (rather than all of them). It The main difference between a port and a pipe is that two or more unrelated processes can communicate over a port. Each port is "owned" by a single process which is the only process that can read from it." Any number of other processes can write to the port. We decided to use ports rather than pipes because they offered us the possibility of writing a single communication subsystem that could be used by all the processes on all the machines. In retrospect, this was a good decision because debugging the communication code proved to be a lengthy and cumbersome task.
B. Implementation Strategy
We began implementation before all the hardware was available: there were problems with the LSI 11/03 microcomputers that served as the query processors and the multiport memory was still under construction. Implementation of the back-end processes on the one hand and the host CIDI process on the other hand proceeded independently and in parallel.
Debugging of the DIRECT software occurred in three steps. In each step we configured a new system. These were known as FQRYP, QRYP, and CQRYP. In FQRYP both the query°3 See Section V for a discussion of some of their shortcomings. 1tActually, since in UNIX children processes inherit all their parents property, a child of a port owner could also read from that port. In practice, chaos would result were a programmer to use this capability.
processors and the multiport memory were "faked." Once DIRECT was operational under FQRYP we moved to QRYP in which actual processors were used for the query processors but the multiport memory was still faked. In the final stage, CQRYP, the processors as well as the multiport memory were used.
To fake a query processor we implemented it as a separate process which included the resident routines that would otherwise be present in the query processor's memory.12 Assignment messages from PKT (those that included code) were read into an array and were executed by jumping into a particular address in that array.
To fake the multiport memory, we allocated an array of vectors in MEM. Each vector consisted of 512 bytes and represented a page frame (the page size was reduced from 16 Kbytes to 512 bytes). The number of vectors corresponded to the number of page frames. Pages were read from the disk into the array. A query processor still sent a page request to MEM and received a reply containing a page frame number. Then, the processor initiated a data transfer from the fake multiport memory by requesting the contents of a specific page frame from MEM. MEM responded with another message that included the desired page. Although this procedure was cumbersome and added unncessary communication overhead, it made the move to CQRYP easier than it would have been had we combined the reply to the request message with the data transfer message.
The three systems still run. We recently used FQRYP to test code installed to perforn update operations.
C. Implementation Time Frame
Our work on modifying INGRES (restructuring the process structure and writing the code generator) required a little less than a man-year. We spent an intensive three month period writing and debugging FQRYP, altogether another man-year. The move from FQRYP to QRYP took 3 days-we had to redo some of the message handling code. While approximately 5 days of effort were required for the move -from QRYP to CQRYP, the actual time period was about 4 months during which several hardware bugs were discovered (and fixed).
It should be noted that one of the reasons the move from FQRYP to QRYP took such a short time is that the software for downloading the LSI microcomputers and using the hardware devices for interprocessor communication had already been developed by another group using the processors for research in distributed operating systems [11] .
To recap, use of emulation as a technique for implementing the system simplified debugging and allowed us to develop different parts of the system independently and in parallel. Also, the code written for the emulated system was usable on the hardware with trivial changes. We still use FQRYP every time a new addition is made to the system. We feel that this implementation technique was very successful because it enabled us to work in an environmenit where debugging was relatively easy and because no work had to be duplicated in mov2Actually, these routines were slightly different-they included more debugging features. 540 ing to the hardware. It should be noted that one reason we were able to follow this path is that all the computers used were of the same family.
V. WHAT HAVE WE LEARNED?
In this section we present what we feel to be the important lessons of our work on DIRECT in the past four years. We briefly discuss the lessons gained from a simulation of DI-RECT, undertaken early on in the project. We then describe some of the problems we encountered during the implementation. This is followed with an enumeration of the "valuable" lessons we have learned. We close with a list of programming tools that we would have liked to have had and a list of future activities.
A. Simulation
After the initial design of DIRECT was completed we undertook to implement a detailed simulation. Concurrently, we analyzed parallel algorithms for aRl the relational operators for a DIRECT-like architecture. Ideally, both of these efforts should have taken place before we began work on the implementation. However, various "economic" factors forced us to commence implementing the system before sufficient ground work had been done.
The primary purpose for implementing the simulation was to study a number of different strategies for allocating processors to tasks. What began as a small limited-scale project ended up as a major effort. Once the simulation was running we were indeed able to select a "best" processor allocation strategy (see [31. We also' found out that DIRECT (as described in [4]) suffers from a number of problems.
Most important of these is the high cost of controlling the execution of a query. Our simulation showed that except for small system configurations (up to 20 processors) the execution time of various query mixes was dominated by the time required by the back-end to process the control messages (e.g., NEXT_PAGE). Another problem is DIRECT's poor perfor. mance when executing selection-only queries. A page of the relation to be restricted must be moved from the mass storage device into the CCD memory and from there into a processor's memory before the selection operation can be applied. Since a selection requires a single scan of the data, most of the execution time is spent doing "useless" work (i.e., the I/O transfers). Yet another problem is that all data transfers between processors are performed at the page level through page frames in the CCD. Although this approach minimizes transfer overhead by grouping data in large units, it results in a significant amount of page fragmentation. In particular, we found that processors executing some operation will, in general, not output full pages. Thus, portions of the CCD memory remain unused.
The simulation proved very helpful during the actual implementation in a number of ways. For example, because of the fine level of detail simulated we had a clear idea of the different functions the back-end had to perform. This information was used in determining the process structure in the backend. We also learned what data structures were required in order to implement the needed functions. Finally, the simulation provided us with the capability of testing different memory management algorithms to be used by the MEM process.
B. Implementation Hassles DIRECT was implemented using the C language on the UNIX operating system. At the outset, we did not appreciate the difficulties of writing a large system. The lack of structure in C (as compared with Pascal) exacerbated the problem. Many simple coding bugs were found only through timeconsuming trial-and-error search or serendipity. A related problem was that our development machine, a PDP 11/40, had insufficient memory to run UNIX Version 7. Thus, we were unable to take advantage of the new debugging and development tools available under this system.
Our use of ports for interprocessor communication resulted in a number of unanticipated (and serious) problems. A process attempting to read an empty port is blocked until the port is filled. This would have been acceptable had there been a way to detect the presence of a newly arrived message. But there was not. Similarly, a process attempting to write to a full port is blocked until sufficient room for the message exists. Since, there were numerous processes writing to each port (for example, an unspecified number of CIDI processes, MEM, and all the query processors write to the PKT port), blocking could result as a side effect of writing to a full port. We were fearful that this could lead to deadlock. We therefore placed constraints on the number of outstanding messages that a process could have at any given time. While we convinced ourselves that deadlock will be prevented, this was achieved at aIpossible performance loss since each back-end process could not attend to a task until receiving a reply to an outstanding message.
An additional problem with ports is that large messages may be split into smaller packets. The intent of this feature is to prevent a single writer from monopolizing a port. Unfortunately, message splitting is not under control of the application and it is difficult to predict when it will occur. The result then, is that ports cannot really be treated as stream I/O devices. This greatly complicated the implementation of the communication code and was a source of numerous bugs.
One of the problems we suffered from was lack of coordination among ourselves during coding. Initially, each person worked individually on a separate process. However, later we began to combine our code. A typical example of the difficulties we encountered at that stage is when we discovered that in the MEM process the named constants TRUE, IN, and YES were used synonymously but defined differently.
C Valuable Lessons
We DIRECT now supports most of the data manipulation capabilities provided by INGRES. We will soon begin to measure its perfonnance. Instrumentation of the system will most likely yield some information that will lead to performance improvements. Unfortunately, the task of instrumenting DIRECT will not be easy because at coding time we did not put hooks into the code to enable us to do this. There are certain places that one can be sure to find performance problems (e.g., access to frequently used data structures). Nonetheless we regret not having had the foresight to organize the code better for this task.
A related problem is that of a test database. During our debugging runs we used a supplier-parts database. This database consists of nine relations, the largest of which has 1200 bytes. The DIRECT page size was reduced from 16 Kbytes to 512 bytes to test execution of an instruction by more than a single processor. Although it was necessary to use a small database for debugging, we now need a "large" database for testing performance. To our regret we have not been able to procure a database much larger than the supplier-parts database we have been using. One long-range goal is to compare the performance of DIRECT and INGRES in processing queries on large databases.
Our experiences with the customized hardware (the multiport memory and the various interfacing equipment) have led us to feel that in a university environment one should attempt to minimize the amount of custom designed hardware. Such hardware should be designed and constructed only after all other alternatives have been exhausted. It may be the case that a few years from now, when sophisticated VLSI design tools become available in universities that custom designed hardware may be easier to get.
One observation that we can make about the performance of DIRECT at this time is that messages between machines require between 10 and 15 ms. Other distributed systems researchers have found this cost to be approximately the same. The majority of this time is spent in the software. Although the cost of message handling in software can doubtlessly be reduced (through the use of microcode or even providing some hardware primitives) the cost of processing a message is likely to remain high. This is a significant problem, and one that seems to be steadfastly ignored by designers of distributed systems (including multiprocessor database machines).
We conclude with a description of those tools that would have aided system development. The ability to share data structures between processes would have simplified the implementation of the MEM, UTIL, and PKT processes and would have reduced the number of interprocess messages necessary to exchange information about the status of an executing instruction, for example. The most complex code in DIRECT (which has already been rewritten three times) is the code to handle interprocess messages. The complexity arises mainly because the Rand port mechanism splits messages at arbitrary places. A more flexible IPC facility which permits the process to specify never to split messages would have significantly simplified this section of code. Another desired feature is the ability to detect the presence of an incoming message rather than blocking when a read is attempted on an empty port. Finally, a more structured programming language with associated development system would have certainly facilitated the debugging task.
D. Future Plans
Although DIRECT is operational, it is by no means dead as a research project. Currently planned activities for the future include the incorporation of aggregates (including aggregate functions); instrumenting the code and making it more efficient; obtaining a "large" database and comparing the performance of DIRECT to that of INGRES; incorporating concurrency control and recovery; moving the back-end to a VAX (this entails switching from Version 6 UNIX to Version 7 and redoing the front-end since the INGRES process structure for VAX UNIX is different); and moving MEM/UTIL and PKT to separate processors.13
