The design trade-off decisions that affect these systems cannot be made in a vacuum. They strongly depend not only on the cost and performance of the hardware subsystems, but also on the pattern of use by the people who use the system.
The task of studying interactions of user demand on a variety of hardware architectures requires a unique modeling system. The model de- scribed here builds on the established technique of analytic queueing network modeling by applying both performance goal constraints and heuristic "grow" procedures. The resulting model permits facile description of an architecture as a collection of hardware subsystems where the exact number and speed of each subsystem are parameterized.
The described architecture can be subjected to a range of load conditions simulating different sizes and kinds of user organizations. By applying heuristic grow procedures, the model finds configurations (in- analytic model fails to account for the spread of congestion from the finite queue of the overloaded node to its neighbors in the network. This congestion adds to the workload of the neighboring nodes, sometimes causing them to become overloaded as well. This kind of dependent behavior accounts for the long delays in response time encountered in a timesharing system undergoing "trashing" due to the lack of a resource such as main memory. The analytic model will, however, pinpoint the node responsible for the situation. 4 The CONFIG procedure ( Figure 1) forms the heart of the program. Figure 2 shows the overall structure of the architectural model. During a run, one or more of the user organization parameters will be forced through a set of values. For each set of user organization parameters, the CONFIG procedure will be run iteratively until the configuration either meets the response time criteria or cannot be expanded any further.
After each iteration of the queueing network model, the architectural model checks for convergence to the response time criterion programmed by the modeler. This study used a weighted average transaction response time of less than a constant set at 15 seconds (intended to represent the "frustration limit" of a human interactive user). The architectural model then searches for the node that contributes the most to weighted mean response time and calls a modeler-written heuristic routine to grow the capacity of that node. Finally, the model iterates again through the queueing network calculation. For configurations that converge, the model also computes a cost function written by the modeler.
The grow routine accepts an integer as input which gives the number of the worst-case node. The routine then returns a Boolean that indicates whether or not it is possible to grow the configuration to enhance the capacity of the node in question; if not, then the architecture has reached a fundamental limitation.
The modeler must specify how to add capacity to a particular node by writing algorithms that increase one or more of the hardware parameters. These algorithms should thus be defined in terms that are natural to the architecture in question. For example, a string of disk drives grows by adding one drive at a time to the string, while the throughput of solidstate memories can best be increased by raising the degree of interleaving.
Adding capacity to a particular node may involve changing parameters that also affect other nodes. For example, adding a disk drive to a string may result in exceeding the number of drives that can be handled by a controller. Similarly, adding processing elements in a bus-coupled architecture may increase the bus's electrical length, thus prolonging its cycle time and reducing its capacity. Grow routines can easily reflect effects like these.
General optimum-seeking techniques cannot account for the fact that some resources, such as disk drives, grow by discontinuous jumps (whole units, in this case). Such 
The cost model
This model produces configurations which, although vastly different in underlying architecture, have the same performance when measured by average response time. Thus, with performance held fixed, cost becomes the obvious way to compare architectures.
The cost of a substystem or assembly consisting primarily of integrated circuits can be thought of as the sum of three costs: that of the chips (proportional to the number of chip in the system), of packaging (proportional to the number of pins), and of power and cooling (proportional to the power consumption in watts).
The cost of a particular integrated circuit depends on many factors: die size, yield, process complexity, and "learning curve" phenomena. A simple model to account for these factors is COST 3 illustrates the maturity learning curve factor, which accounts for phenomena such as process refinements and scaled parts of the same design but with smaller die size and tighter control, all of which reduce costs by increasing yield.
The volume learning curve factor shown in Figure 4 accounts for economies of scale in manufacturing, based on a yield of 10 to 20 percent. The curve reflects the reduced cost of design, testing, and handling on a per-chip basis as production increases.
The final factor in the chip cost equation measures the cost of chips that support the primary type of chip in the systems. These chips are often more numerous and less expensive than the primary type of chip. The values for this factor were estimated based on the inspection of many board types and are supported by industry experience. Many have proposed alternatives for the organization of a database processing system, especially for relational databases. 1 These proposals fall into ftour miajor classes:
(1) A typical von Neumann architecttire ( Figure 6 ) stores the database in "pages" on a secondary storage device. The system loads pages of information inlto the main memory, and the data is operated on by cither a single processor or a shared-memory multiprocessor of convyentional design. The von Neumann architecttire remains an interesting candidate lIor futurc sy.stems because a database imanagement system has many tunctionis besides computing relational cdatabase operators. The merits oI demerits of using a general-purpose device for all of these functions should not be dismissed without analvsis.
(2) A variety of proposals use some sort of associative logic with every "loop" of secondary memory. These architectures assume a serial secondary memory, either inertial (disk or CCD) or non-inertial (bubble). The most advanced of these proposals suggest using a small, fast, contentaddressable memory as part of the logic per loop (Figure 7) , and thus can be deemed "doubly associative."' This study examined several variants on this architecture, using logic-perhead disks or an all-electronic analog, with bubble or CCD memory serving as the secondary associative memory level.
(3) The parallel processing architectures proposed in the literature generally distribute work of fairly small "grain size," where each processing element works on either a character or a bit at a time. However, an architecture could be extremely effective with a larger grain size for the distribution of work. Consequently, one of the series of architectures looks at distributing work with the grain size of a record or a block of records ( Figure 8 ). system among all users. Thus, all communication between users' databases occurs within the centralized configuration.
In clustered architectures, a small number of users share a database processing engine. Therefore, some interdatabase communication can occur within the cluster. However, interdatabase communications involving databases outside the cluster must travel between clusters. Thus, each clustered architecture contains an intercluster network modeled after Ethernet. 10 Smart terminal architectures presume that a processing engine is incorporated into each user's terminal and that any communication between databases requires some kind of network. Each database engine must have sufficient storage capacity (possibly in several hierarchical levels) to contain an entire database version.
Depending on the cost of the memory involved, the smart terminal may hold both the "active" version currently being processed, as well as any inactive versions the user may have. If the memory is fairly expensive, only the active version will reside in the smart terminal; an archival memory will store the inactive versions, which are loaded into the terminal over the network on demand.
The same load profile drove all 30 architectures examined here. The profile represents a mix of functions that does not necessarily correspond to that of any particular user organization. Instead, it exercises several kinds of functions that any database processor would be expected to perform, ranging from trivial requests to extremely complex operations. Neches7 gives a complete description of the transactions composing the load mix, as well as a detailed discussion of how each transaction was modeled for each architecture. The box at left lists all of the trial architectures considered.
Results
Ideally, one architecture would be superior (that is, more economical) for every conceivable combination of parameters. In actual practice, however, different architectures are optimal for different regions in the multidimensional space of organizational requirements. Under a weaker definition, the best architecture is optimum (or closer to optimum) for more of the space of user requirements than any other architecture.
User-shared variants. This study found that the relationship between the three user-shared variants (central, cluster, and smart terminal) remained the same, despite major differences in the basic underlying architecture. For a relatively large number of active users, the centralized approach resulted in the least cost. The clustered approach proved better than the smart terminal approach, even for very small user populations. In general, more demanding load requirements led to a less costly centralized approach for a smaller number of users (Figure 10) .
The major disadvantage of the smart terminal approach stems from its relatively low utilization of computational and storage resources. Since all cases were defined to have an average response time of at most 15 seconds, and since an arrival rate of one request per user every 60 seconds was assumed, the highest utilization possible was 25 percent. For the foreseeable future at least, the heavy resource demands of advanced data management systems will thus be most economically met by shared configurations.
Memory technology. Transmitting pages between the levels of a storage hierarchy constitutes a major share of response time. I I In paged architectures, it usually dominates response time. Therefore, it is necessary to quantify the impact of alternative memory technologies on the performance and cost of advanced data management systems.
Bubble and CCD memories cost more per bit than disk or EBAM. In addition, the cost disadvantages of bubble and CCD technologies increase with larger relation sizes ( Figure 11 ). Since EBAM development has attracted relatively little funding, it appears that the disk is and will remain the memory technology of choice for advanced data management systems.
One can argue that this picture will not change appreciably in the future, since disk, bubble, and CCD technologies are driven by the state of the art in photolithographic techniquesbubble and CCD for device fabrication, and disk for head fabrication. Thus, with a similar scaling future, it seems unlikely that the relative costs of COMPUTER these technologies will change by a factor of two, much less an order of magnitude.
Do three-level storage hierarchies make sense for advanced information systems, perhaps using CCD or a bubble as the middle level? Unfortunately, this case degenerates into paging between the bottom (slowest) and middle levels of the hierarchy. Although the pattern of reference to the pages of a given relation can be optimized, the pattern of reference to relations, and hence to pages on the lowest level of the storage hierarchy, cannot be predicted. II For example, in a multiuser environment the pattern of reference to pages in the second-level store will be disrupted by task commutation.
Architectures with disk. Given the conclusions discussed earlier, the analysis narrows down from the 30 original candidate architectures to just three: paging disk, logic-perhead disk, and distributed function disk-all in their centralized variants.
The paging architecture produces the lowest costs, but it cannot deliver adequate response time when relation sizes exceed 5000 tuples. The disk itself forms the first bottleneck. The results for the CCD and iBAM paging architectures show that without changing the processor architecture, it is possible (though costly) to get a reasonable response time for up to 100,000 tuples from a paging architecture.
Another view of the same results holds that the effective transfer rate of the disk, accounting for latency and system software, is quite low. The actual rate at which pages can be transferred from a disk utilizes only a tiny fraction of the bandwidth suggested by the instantaneous transfer rate of the device. Increasing the instantaneous transfer rate of the disk, either by increasing the bit-packing density or by transferring in parallel from several heads, does not significantly improve matters, because latency effects dominate.
Both the logic-per-head and dis- November 1984 tributed function architectures increase the effective transfer rate from a disk storage facility to a processing facility. In a sense, they implement interleaving schemes for the disk. The logic-per-head architecture moves some of the processing logic into the drive electronics and thus achieves an intimate coupling of processing and storage. The effective transfer rate between the processing elements and the storage elements can With the number of users held at 100, both architectures were subjected to loads in which the cardinality of the larger relation was varied from 500 tuples to 500,000 tuples, while the ratio c(Rl)/c(R2) was varied from one percent to 100 percent. Figure 12 presents a summary of these results.
The logic-per-head architecture performs best when the cardinality of the smaller relation is very small compared to the cardinality of the larger relation. Such architectures effective-38 ly match a small number of patterns against a large number of candidates. However, with a large number of patterns to match, the distributed function architecture is more economical. Figure 13 shows the problem domains in which the distributed function, logic-per-head, and paging architectures produce the most economical results. It is worth noting that even where the distributed function architecture was not the lowestcost solution, it resulted in costs never more than 20 percent higher than the optimum.
Implications of distributed function. The interconnect structure forms the crucial element of any distributed function architecture, because it represents the ultimate bottleneck. The modeled properties of the distributed function architecture suggest several requirements for such a structure.
The interconnect must have a broadcast mode of operation so that a message requesting that some semantic operator be invoked will reach all processors simultaneously. While an ohmic wire system such as Ethernet can do this, it cannot handle the problem of acknowledgment, that is, of ensuring that the message has been correctly received by all processors.
Because an interconnect physically connects a large number of processors, the distance covered by the interconnect will be large, certainly larger than several cabinets.
Since it is the bottleneck, the interconnect structure must also be This last requirement suggests that the interconnect structure could be a tournament sort binary tree, with sorting elements in the leaf-to-apex direction and broadcast elements in the apex-to-leaf direction. With the 150-ns cycle period assumed in the model, nodes of the tournament sort tree could readily be separated by up to 10 meters of wire. Thus it would be possible to configure a physically large system, albeit in one rather large room.
The implementation of such an interconnect represents an interesting challenge. Given the effort involved in implementing such an interconnect, would it be relevant to other computing problems besides data management?
The distributed function architecture assumes that the data management problem can be divided into subproblems with a fairly high locality of reference. However 
