Abstract
Introduction
Typical U 0 devices consist of the physical device hardware (e.g., disk platters, read/write heads), device specific electronics (e.g., sense amplifiers) and generic electronics 0-7695-0568-6/00 $10.00 0 2000 IEEE (a general purpose or special purpose embedded microprocessor or processors). With the rapid growth in processing power per processor (estimated at a rate of 60% per year [ 1 I]), it is reasonable to consider implementing and treating the processing power placed in a disk controller as general purpose, and not just as a dedicated microprogrammed embedded controller. For instance, a 33 MHz ARM7TDMI embedded processor has recently been used to implement all the functions of a disk controller, including the servo control [3] . In addition, many of today's storage adapters as well as outboard and Network Attached Storage (NAS) controllers already contain several general purpose commodity processors to handle functions such as RAID [6] protection and network protocol processing. If a moderately powerful general purpose microprocessor is combined with a reasonable amount of local memoty, and placed either in a disk controller or a storage controller (i.e., a controller which controls multiple devices), then there will exist a general purpose outboard CPU with substantial excess processing capacity.
Recent advances in software technology make using this processing capacity easier than previously. In particular, software fault isolation techniques [26] as well as robust and secure languages such as Java [9] enable applications to be effectively isolated so that they can be safely executed on a machine without causing malicious side effects. Recent emphasis on architectural neutrality and the portability of languages [9] further enhances code mobility and eases the way for code to be moved to different machines for execution. For example, in SUN'S Jini framework [22] , application code can be downloaded to the device as needed. The convergence of these hardware and software developments provide an opportunity for a fundamental shift in system design by allowing application code to be offloaded to the peripherals for execution.
In this paper, we propose a general architecture for Smart Storage in which a processing unit that is coupled to one or more disks can be used to perform general purpose processing offloaded from the host. Essentially, we envision a system in which the host supervises a number of SmartSTORs, each of which consists of a powerful processing unit, a useful amount of local memory, and a number of 110 devices, usually disks. The host processor may generate tasks specific to one SmartSTOR (i.e., only needing data local to that SmartSTOR) and delegate that work to the SmartSTOR, which would then deliver the result to the host. Alternatively, the SmartSTOR can be handed more complicated tasks that require coordination with other SmartSTORs. If the generation and delegation of these tasks can be sufficiently automated and reliable, and if the load balancing is successful, then the processing power of the SmartSTOR CPUs and the host becomes additive, and the result is a much more powerful system.
Besides allowing processing to be offloaded from the host processor, the Smart Storage architecture also reduces data movement between the host and storage subsystem. In addition, it allows processing power to be automatically scaled with increasing storage demand. An important feature of a SmartSTOR is that it may be configured as NAS, in which case the processing power in the SmartSTOR would be available to any system mounting that storage. Other advantages of embedding intelligence in storage include simplifying the costly task of system management [5] .
An essential element to the success of the Smart Storage architecture lies in convincing software developers that SmartSTOR is a viable and attractive architecture. Projecting the performance potential of the SmartSTOR architecture relative to the required software effort is an important first step in this direction. Since decision support workloads are increasingly important commercially [4] , a major part of this paper is devoted to understanding how these workloads will perform on the SmartSTOR architecture. In particular, we evaluate the performance of the Transaction Processing Performance Council Benchmark D (TPC-D) [23] , which is the industry-standard decision support benchmark, on various SmartSTOR-based systems.
Our methodology is based on projecting SmartSTOR performance from current system performance and parameters. More specifically, we use the system configurations of recent TPC-D results to determine the number of SmartSTORs that will be needed. In addition, we examine the query execution plans from two recently certified TPC-D systems to establish the fraction of work that can be offloaded to the SmartSTORs. We also use recent TPC-D results to empirically derive the system scalability relationship so that we can estimate the effectiveness of distributing a query among many SmartSTORs. There are clearly limits to this projection approach but we believe that it is the most effective and appropriate methodology at this early stage.
The rest of this paper is organized as follows. In the next section, we discuss related work and highlight some of the unique features of the SmartSTOR architecture that make it more viable than other previous proposals. In Section 3, we describe the hardware and softwarle architecture for Smart-STOR. This is followed in Section 4 by a discussion of the methodology used to project the performance of TPC-D on systems with SmartSTOR. Performance analysis results are presented in Section 5. Section 6 concludes this paper.
Related Work
There have been some recent proposals for embeddhg intelligence in disks [ 101 and thelie include the Intelligent Disk (IDISK) [18] and the Active Disk [ l , 201. The processors that can be used in these disk-centric proposals are subject to the power budget and stringent cost constraints of the disk environment -generally disks are fungible and are sold almost entirely on the basis of price. The market for high costhigh performancehigh functionality disks is very limited, and thus prices for disks in this market segment are higher than they would otherwise be due to the loss of iefficiencies of scale.
On the other hand, SmartSTOP., by operating at the level of the storage (i.e., multiple device) controller, can offer processing units that are more substantial and therefore easier to effectively use. Moreover, by allowing a processing unit to be coupled to one or more disks, the SmartSTOR architecture allows for more flexible scaling of processing power to increasing storage demand. In the nearer term, the SmartSTOR architecture is likely to be easier to accomplish because increasing the processing power on an adapter or controller to handle general purpose processing is less risky than modifying the actual disk design. It also lowers the barrier of entry and opens up the architecture to the creativity of more than just the few disk companies. Finally, it separates the manufacturing of low cost disks (most of which go into PCs) from high performance controllers (which can go into servers, clusters and mainframes;, and which are relatively price-insensitive).
The idea of moving processing closer to the disk was studied extensively in the form of database machines during the late 1970s and early 1980s [8, 151. Most of those database machines relied on costly special-purpose Inardware which had to be specifical!ly programmed and which prevented the database machines from taking advantage of algorithmic advancements and improvements in commodity hardware. In most cases, the functionality of the database machines was limited; they could not do arbitrary database operations. In addition, the reliiance on highly-specialized hardware made it difficult to develop succeeding generations of the system so that it was not worthwhile to expend significant effort programming tlhese machines.
In contrast, the SmartSTOR architecture leverages com- 
The SmartSTOR Architecture
The proposed Smart Storage architecture consists of a processing unit that is coupled to one or more disks. Figure 1 depicts such an architecture. We define the cardinality of a SmartSTOR to be the number of disks it contains. A SmartSTOR with a cardinality of one contains a single disk and is referred to as SD. In our performance projection, SD is conceptually equivalent to an IDISWActive Disk. When a SmartSTOR contains multiple disks, we refer to it as MD.
A SmartSTOR-based system is similar in many aspects to a cluster of general-purpose nodes made up of commodity parts but there are several important differences. In general, SmartSTORs can be built by increasing the processing power of existing storage adapters or controllers, many of which are based on commodity components. Such an approach allows us to save on the supporting infrastructure (e.g., chipset, power distribution, physical packaging). Another approach to building SmartSTORs is to perform a limited amount of custom packaging and "decontenting", i.e., removing parts that are not needed or will not be noticed, on a standard PC design. For example, we could easily put together a package consisting of an X86 processor, a power supply, a network card and several disks to serve as the hardware for a SmartSTOR. This allows us to leverage the most cost effective parts and to achieve more efficient packaging and reduced component count. Besides saving on the upfront equipment cost, SmartSTOR also has the potential to reduce operating costs through more efficient packaging which takes up less floor space. In addition, SmartSTORs can be made easier to manage than general purpose PCs [5], especially since they are designed to handle specific tasks that are offloaded from the host through a well-defined interface.
Ultimately, the success of the SmartSTOR architecture hinges on the availability of software that can take advantage of its unique capabilities. Figure 2 shows a spectrum of software options, each having different performance potential and requiring different amounts of software engineering effort. At this point in time, it is not apparent which software architecture, if any, will provide enough benefits to justify its development cost but through the performance projection that we will perform later in this paper, we hope to gain some understanding that will help developers reach their own conclusions.
Intuitively, data intensive operations like filtering and aggregation should be offloaded to the SmartSTOR. More generally, operations that rely solely on local data belonging to a single base relation are good candidates for offloading. We refer to this single-relation offloading as SR. Such operations are the basis of database queries and includes tablehndex scan, sort, group by and partial aggregate functions. Basically, SR includes all single-relation operations before a join or a "table queue", which is a mechanism through which the database management system (DBMS) distributes data among its agents.
Although single-relation operations are the basis of database queries, a typical decision support query involves a lot more than just these basic operations. To distribute more processing to the SmartSTORs, we have to consider offloading multiple-relation operations such as joins that may involve data in one or more SmartSTORs. Such multiplerelation offloading is referred to as MR. At the extreme end, this is functionally equivalent to running a complete sharednothing DBMS [7, 191 such as IBM's DB2EEE [17] and an operating system on each SmartSTOR. However, such an approach has hefty resource requirements. In this case, using SmartSTORs with a more substantial processing unit shared among multiple disks is likely to be more effective than an IDISWActive Disk setup. It may be possible to trim the DBMS to contain only the functionality profitable for offloading and to use less memory-intensive algorithms but coming up with this and other software architectures is an open research problem.
Projection Methodology
In this section, we outline the methodology that we use to assess the effectiveness of the SmartSTOR architecture and the relative merits of the various hardware and software [23] , which is the industry standard benchmark for decision support. Readers who are interested in the characteristics of the benchmark are referred to [12, 131, which contain a comprehensive analysis of the benchmark characteristics and how they compare with those of real production database workloads. This paper is based on TPC-D version 1 since it has the largest number of published results. As soon as enough TPC-H [24] results are available, we plan to do a followup study to see whether the same trends are observed with the new benchmark. Because both hardware and software technologies are advancing rapidly, we decided to look at the more recent results, specifically those that were published between July 1998 and January 1999. We omit the very recent results because we believe that these very recent setups have been so fine-tuned for running the benchmark that attempting to lump them in with the other results would be meaningless. In particular, the aggressive use of Automatic Summary Tables (ASTs), i.e., auxiliary tables that contain partially aggregated data, enable processing to be effectively pushed to the database load phase, which is not part of the TPC-D performance metric, so that very little processing needs to be performed when executing the queries. Nevertheless, if we have enough such results, it might be interesting to apply the same analysis to them as a separate group.
U 0 Bandwidth
There are two likely major advantages to the SmartSTOR architecture: (a) the amount of data that needs to be moved from the disks to the host for processing should be significantly reduced; (b) the actual processing can be offloaded from the host and done in parallel by the many processors within the whole system. Since decision support workloads are very data intensive, it is generally believed that they will benefit substantially from the potential decrease in I/O traffic.
Based on measurements' performed on several certified TPC-D setups, we have been able to establish a simple rule of thumb relating the TPC-D scale factor to the amount of physical I/Os required. More specifically, we find that for a database of scale S , a total of about 3 . S GB of data are transferred between the host and storage system during a TPC-D power test. With improvements in the memory capacity of the host system and more sophisticated database optimization, the constant 3 is expected to gradually decrease over time. Our measurements also indicate that. the peak bandwidth requirement is about 3.3 times the average. Note that these rules of thumb are based on measurements conducted without the use of AISTs. With ASTs, the I/O bandwidth required will be lower.
In Table 1 , we apply these rules of thumb to estimate the I/O bandwidth consumed in some recent TPC-D benchmark runs. The highest per node I/O bandwidth consumption (1251.50 MB/s peak) is otiserved on a 32-processor system with a 12.5 GB/s system bus and which can be configured with 32 PCI buses each having a peak bandwidth of 528 MB/s. This puts the peak baridwidth consumed at about 10% of the bandwidth available. The highest per processor I/O bandwidth consumption is about 48.43 MB/s peak and occurs on an 8-processor system with a 3.2 GB/s system bus. This system can be configured with eight 528 MB/s PCI buses. Such results suggest that decision support workloads similar to TPC-D may not impose extra 110 bandwidth burden over that required for other workloads that today's systems are designed to handle.
To further understand this rather surprising finding, we examined the query execution plans from a recently certified TPC-D setup. .y on an index in one way or another. In this particular TPC-D setup, a total of twenty-six indices are defined over the eight relations. Perhaps as a reflection of the fact that the TPC-D benchmark has been well studied and understood, there are many cases of index-only-access in which all the required fields are defined in the indices. It appears that the judicious use of techniques such as indices has been extremely effective at reducing the amount of U 0 bandwidth required to support a TPC-D-like decision support workload. Therefore, for the rest of this paper, we will concentrate on the offloading aspect of SmartSTOR.
System Configuration
The first step in projecting the performance of TPC-D on the SmartSTOR architecture is to determine the number of SmartSTORs that will be in the system and the processing power that they will possess. As is typical of forwardlooking studies, we assume that some aspect of the system, in this case the number of drives, will remain the same. Table 2 summarizes the relevant configuration information for the recent TPC-D results. Recall that cardinality is the number of disks per SmartSTOR. For each setup, we project the number of SmartSTORs in the corresponding future system by: In order to describe the processing power available in the SmartSTORs without using absolute and therefore time frame dependent numbers, we introduce the notion of performance per disk (perf-per-disk), which is the effective processing power per disk relative to the host processor.
num-SmartSTOR num-disk cardinality

perf-per-disk = processing power per SniartSTOR processing power of host processor . cardinality
The actual value of perf-per-disk depends on the cardinality, family and generation of processors used, the power budget, the system design, etc. and is open to debate. In general, we believe that if the processor is embedded in the disk as opposed to the adapter or outboard controller, it will tend to have lower performance because of the smaller power budget and the much more stringent cost constraints in the disk environment. For an intelligent adapter or controller, the embedded processor may perhaps be even as powerful as a host processor, although that is unlikely to be cost effective. In either case, the embedded processor is likely to be used also for tasks, some of which are real-time, that are previously performed in special-purpose hardware. Since it is premature to specify precise values for perf-per-disk, we perform sensitivity analysis on the parameter later in this paper.
SR Performance
Recent work has shown that single-relation operations such as SQL select and aggregation can be very effectively offloaded to an Active Disk [ 11. However, a typical decision support query involves a lot more than just single-relation operations. In most cases, the results of the single-relation operations are combined through joins to create new derived relations that are further operated on. Therefore, determining the actual fraction of work that can be offloaded by SR, only when it makes sense to do so., the speedup is:
and thereby the potential overall speedup, is complex.
Our method for determining the fraction of work that can be offloaded by SR is to analyze the query execution plans. Measuring the CPU time needed for each individual operation in a query execution plan is extremely difficult because the operations are executed simultaneously in parallel or in a pipelined fashion. Therefore, we use the CPU costs estimated by the query optimizer to determine the fraction of work that can be offloaded by SR.
The results presented in this paper are based on the query execution plans from two recently certified TPC-D setups. In order to understand the possible range of values, we consider both a shared-everything and a shared-nothing DBMS. The first system we consider is a Symmetric Multiprocessor System (SMP) running IBM's DB2/UDB [ 161, a shared-everything DBMS, while the second system is a cluster-based setup running the sharednothing IBM DB2/EEE [17] . The complete set of plans from the first system is available in [ 141. Table 3 summarizes the fraction of processing that can be offloaded by SR for the two TPC-D setups. From the table, Query 1 is the only query that can be offloaded by more than 50% in System 1. Observe further that only 5 out of the 17 queries can be offloaded by more than 10% in System 1. System 2 is generally more amenable towards single-relation offloading but it is still the case that less than half of the queries can be offloaded by more than 10%. According to Amdahl's Law [ 1 11, these statistics suggest that the performance potential of SR may be limited. However, the fact that there is substantial difference between the figures for the two setups suggest that there may be considerable room for improving the plans generated to better take advantage of the SmartSTOR architecture. This is an area that requires further research.
Suppose that f is the fraction of processing that can be offloaded by SR. Assuming that host and SmartSTOR processing are maximally overlapped, the speedup that can be achieved by SR is:
is the aggregate processing power available in the SmartSTORs relative to that in the host. If we further assume that the system will be intelligent enough to offload operations As we shall see, even with such optimistic assumptions, the performance potential of SR is rather limited.
Assuming that the current run time for query i is Q I ( i ) , we can project the run time for the query on a SmartSTOR arc hi tec ture, Q I ( i ) ' , by:
The TPC-D benchmark defines both a power metric and a throughput metric [23] . Since we are primarily interested in speedups, we focus on the power metric, QppD, in this paper. In determining the avera.ge performance improvement possible in a SmartSTOR architecture with SR, we use the projected query run times, QI(i)'s, to determine the speedup in QppD for each of the recent 19 TPC-D systems. Then we take the arithmetic mean over the 19 setups to obtain an average improvement in 12ppD. Note that QppD includes the execution times of two update functions, which we assume cannot be offloaded by SR. Also, as discussed in [14, 231 , the definition of QppD limits the run time of any query to be at most 1000 tirnes shorter than that of the slowest query.
MR Performance
In general, when work is distiributed across multiple processing units, skew comes into play so that the performance of the system scales sublinearly with the number of processing units. For a well-understood workload such as TPC-D, we can try to distribute the tuples in the base relations evenly across the SmartSTORs so as to minimize any data skew. Therefore, for SR, the portion of work offloaded is likely to be sped up by the extra processing power available in the SmartSTORs. However, for more complhcated operations that involve redistributing tuples or that involve derived relations, there is likely to be an unequal distribution of relevant tuples across the SmartSTORs.
In order to project the performance of TPC-D when multiple-relation operations are offloaded, we need to understand how effectively the work can be distributed across the SmartSTORs, i.e., we need to understand the scalability of the system. Since we are not aware of any generally accepted model of scalability for TPC-D, we empirically derive a model by using the recent TPC-D results. Because these results were obtained on systems with different processors, we have to first normalize them. Let: SPEC measurements [21] are more indicative of performance on CPU intensive integer and floating point workloads, rather than on the CPU portion of database system and operating system workloads, but we are not aware of any better alternative; we therefore normalize TPC-D performance by SPEC numbers [21] . We believe that the errors introduced should be small. Figure 3 plots the database efficiency of the recent 300 GB TPC-D results. We choose to use the 300 GB results because the benchmark setups for this scale factor have a wide range in the number of processors used. Observe that the set of points can be roughly approximated by where C is a constant. We refer to this vnum-host-proc ' scalability rule as the cube root rule in that when the number of processors is increased by a factor of eight, the per processor efficiency is halved. We expect the scalability of the system to improve with advances in both hardware and software. Therefore, we use the fourth root rule to consider future TPC-D system scalability. With the fourth root rule, the per processor efficiency is halved when the number of processors is increased by a factor of 16. Note that real workloads are unlikely to be as well understood and tuned as the TPC-D benchmark and the processing will tend to be less well distributed. In other words, real workloads will probably scale more poorly with the number of processors. Therefore, we also consider the square root rule.
Using these scalability rules, we can establish a relationship between QppD and the number of processors and their processing power. Therefore, assuming that the system will be intelligent enough to offload operations only when it improves performance, the speedup is:
Using this result, the improvement in QppD can be projected for each of the 19 recent TPC-D systems. As in the case for SR, we take the arithmetic mean over the 19 setups to obtain the average projected improvement in QppD.
Analysis of Performance Results
Based on the steps outlined in the previous section, we can analytically derive the improvement in QppD for the various hardware and software options. The results are summarized in Figure 4 , where we plot the projected speedup for the various alternatives as a function of the effective processing power per disk. To reduce clutter, we use MRSD, MR2D and MR4D to denote multiple-relation offloading on SmartSTORs of cardinality 1 , 2 and 4 respectively. For MR, we plot the projected range of speedup with the square, cube and fourth root scalability rules. The lightly shaded regions in the figure are bounded by the potential speedup with the fourth and cube root rules. The corresponding range of speedup with the cube and square root rules are represented by the textured (bricked) regions. For SR, we plot the range of speedup given by the two sets of offloading fractions discussed in Section 4.3 and presented in Table 3 . Note that the figure makes no cost statement. This is deliberate since accurate cost information are generally closely guarded and in any case, are very technology and time frame dependent. Given a set of cost estimates, Figure 4 can be used to determine whether SmartSTOR is a cost-effective approach and if so, the configuration that should be used.
Recall from our scalability model that for MR, TPC-D performance tends to scale rather sublinearly with the number of processors used. This shows up in Figure 4 in that for the same perf-per-disk, MR4D is projected to have a performance advantage over MR2D and an even bigger advantage over MRSD. Therefore, for a given aggregate processing power, it is significantly more effective to share powerful processors among multiple disks than to have less powerful per-disk processors. This may not be the most cost effective solution, however; current pricing policies are such that prices go up more than linearly with processor speed. Note also that MD is limited by the fact that there are no arbitrarily powerful processors. In this paper, we assume that the embedded processor is at most as powerful as the host processor so that the maximum perf-per-disk is given by cardinality ' A natural question to ponder at this juncture is how does IDISWActive Disk compare with SmartSTOR? IDISWActive Disk is conceptually identical to an intelligent adapter or controller of cardinality 1 with the exception that it is likely to have a lower perf-per-disk. As discussed earlier, the exact value of perf-per-disk is arguable but with the much more stringent power and cost constraints in the disk environment, the processors that are embedded per disk are likely to be significantly less powerful than those in SmartSTOR. With a perf-per-disk value of 0.25, the projected improvement in QppD ranges from 1.16 to 1.39 for SR and from 1.05 to 1.88 for MR. For comparison, this ratio of processing power is about equivalent to that between a 200 MHz Intel Pentium MMX and a 575 MHz Compaq Alpha 21264 (based on SPECintbase95). As mentioned earlier, a possible approach to building SmartSTORs is to decontent a standard PC. In this case, a rough value of perfper-disk for a SmartSTOR may be cariiality. Based on this, the projected speedup in QppD for cardinalities of 1, 2 and 4 with multiple-relation offloading ranges from 3.05 to 6.02, from 2.15 to 3.58 and from 1.52 to 2.13 respectively. For single-relation offloading, the corresponding ranges are 1.20-1.59, 1.17-1.48 and 1.15-1.35.
Notice from Figure 4 that multiple-relation offloading clearly outperforms single-relation offloading for practi-1 cally all values of perf-per-disk. However, by distributing more processing to the SmartSTO Rs and harnessing more of the parallelism in the system, h4R is likely to require a lot more resources, particularly memory, from the SmartSTORs. As we have discussed earlier, MR in its bro,adest sense is functionally equivalent to running a complete shared-nothing DBMS and an operating system on each SmartSTOR. An interesting research question is whether it is possible to effectively separate out the functionality profitable for offloading and to use new algorithms that are less memory-intensive. However, as we have alluded to earlier, the performance potential of the SmartSTOR archilecture has to be evaluated relative to the cost of any software reengineering, and to the cost of building the SmartSTORs themselves.
Finally, an important point to note is that among all published TPC-D results so far, the largest number of processors used is only 192 while the largest number of disks used is over 1,500. If embedding intelligence in storage is an inevitable architectural trend, we have to focus on improving the scalability of parallel software systems to effectively take advantage of the large number of processors that will be in the system.
Conclusions
In this paper, we have proposed a general Smart Storage (SmartSTOR) architecture in which general purpose processing can be performed by a prcicessing unit that is shared among one or more disks. The SmartSTOR architecture provides two key performance advantages, namely a reduction in U 0 movement between the host and I/O subsystem, and the ability to offload some of the work from the host processor to the processing units in the SmartSTORs. In addition, SmartSTORs may be configured as NAS devices, in which case the processing power in the SmartSTOR would be available to any system mounting that storage.
The analysis performed in this paper suggests that U 0 bandwidth may not be that serious a bottleneck for TPC-D. Therefore the main advantage of using SmartST0R.s for workloads similar to TPC-D appears to be the ability to offload some of the processing from the host. By analyzing recent TPC-D results, we find that the performance of decision support systems scales rather sublinearly with the number of processors used. Therefore, our results indicate that there is a definite advantage in using fewer but more powerful processors. In view of i.his and the arguments presented in the paper, we believe that intelligent adapters or controllers that share a substanlial processing unit among multiple disks may be an interesting architecture.
As for software architecture, our evaluation shows that the offloading of database operations that involve multiple relations is far more promising than the offloading of op-erations that involve only a single relation. In either case, if embedding intelligence in storage is an inevitable architectural trend, we have to develop parallel software systems that are more scalable to effectively take advantage of the large number of processing units that will be in the system.
