INTRODUCTION LEXIS-NEXIS
uses a diverse collection of computer systems to deliver on-line information services world-
wide. An important part of the development of new information services is estimating the system capacit y required to support the product. Even for existing products, it is important to estimate capacity requirements for services experiencing growth.
At LEXIS-NEXIS, capacity estimates are typically based on a simulation of the processes involved in preparing and delivering a new service. Usually simulation models describe processes in great detail and include such tasks as computation of specific subprocesses, input/output (1/0) of data, remote procedure calls (RPC'S), etc. The data used to describe time delays and resource consumption (CPU, memor y, etc. ) were based on expert estimates. Most estimates were constants or simple functions of 1/0 size. These methods were used because they produced reasonable results and no better information was available. In 1994, however, extensive controlled testing was done on both HP and Sun servers to gather precise data on 1/0 times and resource consumption.
At about the same time the modeling effort described in this paper was initiated to develop a better method for simulating resource consumption and time delays caused by 1/0. The concept was to produce a portable module that could be inserted into any LEXIS-NEXIS performance simulation that required modeling local (as opposed to network) Unix 1/0. The module needed to be customizable to the actual physical configuration of the hardware, and to simulate resource consumption and delays with high fidelity but minimal additional computational burden on the simulation. Ultimately, the HP and Sun laboratory tests were used to parametrize and validate this module. The remainder of the paper describes the processes used to develop, validate and apply this improved modeling process.
SYSTEMS MODELED
Both the HP and Sun systems of interest to us had similar overall architectures: Multiple CPUS are connected to memory modules via a high-speed backplane (about 240-960 megabytes per second). All 1/0 data passes through a converter that contains buffering to match the speed and the protocol of the peripheral bus to that of the backplane. SCSI controllers are connected to a peripheral bus with speeds on the order of 30-50 megabytes per second.
The actual 1/0 devices are then attached to the SCSI de-vices. The DASD (disk) devices for both systems had very similar architectures with on-board CPUS and buffers, as well as look-ahead capability.
The details of each of the main components differs considerably between the two systems despite the similarity of the overall architect ures,
The details of the architecture of the HP9000 Model T500 maybe found in the article by Alexander, et al. (1994) . The main concerns for our work were not the internals of the system, but rather the rates at which various modules process data and various buses and paths transmit data. The HP9000 backplane operates at a rate of 240-960 MB/s (megabytes per second), and our system had the maximum capability.
Because of the high speed and the use of a packet-switching protocol on the bus to transfer data, we assumed that there would never be any queueing on this device, only transfer delays. Hewlett-Packard (HP) uses a device called a bus converter to provide an interface for the peripherals to the backplane. This device has two 64 MB/s Precision Buses, each of which can have 1-7 SCSI adapters attached to it.
The actual data transfer rate after protocol overhead is about 42 MB/s. In our experience, this data bandwidth is sufficient to present no detectable queueing delays.
The DASD devices are connected to fast, wide SCSI adapters that have a nominal transfer rate of 20 MB/s. These adapters can accept up to 12 devices daisy-chained along a single data path. The data path from the device to the adapter has the capability of handling up to 20 MB/s for data buffered at the device. The details of the HP DASD devices may be found in Ruemmler and Wilkes (1993) . Their paper gives the necessary details for calculating the device level delays and accounting for the effects of the on-board cache and the look-ahead algorithm of the device.
More importantly, it provides an algorithm for calculating seek time that is much better than anything used previously.
At the operating system level, the HP9000 uses HP's version of Unix called HP-UX. This is a version of Berkeley Unix. The most important consideration here is that 1/0 is handled by a set of cache buffers that have their own management system. This became a major performance concern in trying to validate our module against the benchmark data. Based upon our results, the space devoted to 1/0 cache is divided into blocks of predetermined size, in our case 8172 bytes, and evenly distributed among buffer queues. The number of queues appears to be 128 plus the number of application files. To find out if a block of data is in cache, the operating system hashes to the correct queue and then sequentially searches the queue for the desired block. The implications are that the larger the cache is relative to the number of application files, the more time will be spent in sequent ially searching the queues. In the case of the HP9000, several relatively complete throughput-response curves were obtained for the various configurations, and eventually a set of parameters was developed for our module that gave an error of 5% or less compared to the benchmark data. In the case of the SPARC 2000 tests, the system was greatly over-stressed, the goal being to see what the maximum throughput was. As a conse-quence, not all parameters could be as carefully defined as for the HP9000.
However, useful data were obtained. Figure 1 ; each box in the figure corresponds to a queue in the module.
The philosophy of the module is to randomly sample processing times (service times, in queueing terminology) as a function of 1/0 size, but compute expected queueing delays based on system load as characterized by recent CPU utilization, read/write rates and the current probability of a memory cache hit. In this way the module achieves a high degree of fidelity without incurring the prohibitive computational cost of explicitly simulating each 1/0.
Each subsection below describes one component of 
Cache
The CPU cache model has the following inputs: llo = overall rate of reads and writes in MB/ins; this parameter may be dynamic (iolam).
Jr. = rate at which data is reused in MB/ins; can also view l/ArU as the expected time between requests for the same data; this parameter is not dynamic (ruhus Markov-process model assumes that data requests arrive to cache according to a Poisson process at a rate of A110, and data are reused at a rate Jru, with the time between reuse being exponentially distributed. A data quantum is pushed out of cache when it is the oldest quantum in cache and something new is pushed in. These assumptions allow a steady-state value of phit to be calculated analytically,
CPU
The CPU model has the following inputs:
PCPU= utilization per CPU for tasks with which 1/0 has to contend (i.e., other system calls); this parameter is dynamic (crho). k = number of CPUS (k).
p.P. s service rate of CPU in MIPS (CZIU).
A -kpcP.p.Pu -the effective arrival rate to the Cpu -CPU in the same units as pcPu (Glare).
For each task performed, the constants y. and yl such that the total CPU processing time consumed by that task is given by t(I/0 size) (task,hit) and ganunal(task,hit)).
l/PCache~the expected time to read the cache buffer (cacherd), which is afunction ofthenumber of pages in a string (cachstrg) and the time to read one page (pagerd).
cache =~1/o /a~10 E arrival rate to the cache-buffer queues in units of 1/0's per time (cchlam).
The model returns a total CPU processing time and an expected total delay waiting for the CPU; the same quantities are returned for contention and use of the cache-buffer queues.
The collection of CPUS is modeled as an M/G/k processor-sharing queue. Processing times at the CPU are calculated as a deterministic function of overhead plus a linear function of 1/0 size, depending on 1/0 task and whether or not there is a cache hit.
Expected delays waiting for the CPU are proportional to the total processing time for a processorsharing model, and depend only on the service rate and not the distribution (Cooper, 1981) .
Contention for the cache-buffer queue is modeled as an M/G/l queue. The actual processing time depends on whether or not there is a cache hit or miss; when there is a hit then the time to search out the appropriate page is sampled.
SCSI Host Adapter
The adapter model has the following inputs: n s number of adapters (n).
n" = number of "hot" adapters (nhot).
.&, = is as measure of imbalance between "hot" and not "hot" adapters, given as a fraction of all 1/0 requests routed to the "hot" adapters (f hot).
. 
The model returns a total adapter processing time and an expected total delay waiting for the i~dapter.
Each adapter is modeled as an M/D/l queue. Processing times at the adapter are calculated as a deterministic function of overhead plus a linear function of 1/0 size, depending on 1/0 task. Since two delays are experienced at the adapter (outbound to the data path and inbound from the data path), the total delay is~a dpt //%dpt =~adpt /Padpt 2x 2(p.dpt -Jadpt ) padpt -~.dpt
Data Path
The data-path model has the following inputs: The backplane is modeled aa pure processing time (overhead plus a linear function of 1/0 size) with no contention (queueing).
Backplane time accounts for all transmission time down to the adapter level and back up from the adapter level.
Device
The disk storage device model has the following inputs:
Pdhit s probability of a device cache hit (pdhit).
r~total rotation time, in ms (rotate).
pshort u probability of a short head movement (pshort).
t,h~~t~time to make a short head movement and settle, in ms (short). The seek time is modeled as a two-point distribution on t,ho,t and tl~~g. The expected value and variance of this random variable are used in the queueing model, but a sampled value is returned. When a cache hit occurs, the device processing time is t(I/O size) = q. + (1/0 size)/ql.
Task Models
The Unix 1/0 module can simulate the seven 1/0 tasks described below.
1.
2.
3.
4.
Open an unopened file without accounting for prefetch:
(a) CPU processing and delay; always a memory cache miss The inputs are the configuration platform cp, the task type (as listed above), the size of the 1/0 in MB,thecurrentinput/output rateiolaminMB/ms, and the current CPU utilization O < crho < 1. The dynamic parameters crhoand iolamshould notbe based on instantaneous snapshots, however ,since the delay values returned by subroutine ioarebasedon steady-state queueingmodels.
Rather, average values forarecent timeperiod should bypassed. Inthis way, the 1/0 module can reflect current load on the system. Some experiment ation is required to determine an appropriate time window for averaging.
PARAMETRIZATION
In its final form, the Unix 1/0 module requires two different types of parameters:
(1) those that change with each simulation scenario or group of scenarios and deal with the configuration of the system and the data layout on the devices, and (2) is the probability of a device cache hit.
In our applications, the size of the 1/0 ancl the totally random access patterns created a probability of a device cache hit of near zero, and we oflten used zero for this value. In cases where there are a high proportion of records shorter than a block anld a high degree of sequential access, the probability of a device cache hit becomes quite high. Ruemmler and Wilkes (1992) discuss this issue in detail.
System internals are values that change only with a change of platform.
They are implemented as permanent reference files that are called as needed when the overall computing environment changes its configuration. Specifically, they are matrices of values, based on various divisions of the model: CPU, backplane, cache manager, SCSI adapter, data path from device to adapter, and the device. For each component of the Unix 1/0 module there is a set of seven parameters, one for each of the 1/0 tasks. Each set of values includes a base service time and a data-size dependent rate for cache miss situations and the same pair for cache hit situations. For those elements that do not have dependence on the task or for those elemental parameters that are not task dependent., a single set of parameters was created.
The internal parameters for consumption of CPU, transfer on the backplane, adapter processing and transfer on the data path from device to the SCSI adapter each require a parameter for the fixed delay (in some cases zero) and another for the transfer rate for calculating data-size dependent delays. In the case of CPU processing, an additional set of parameters is created to account for the differences between memory cache hit and cache miss. There are a pair of these values for each of the seven 1/0 operations modeled. The CPU values are for setting up the 1/0 and not for doing the cache search, which was calculated separately then added to the calcula~ted CPU time. These parameters were obtained primarily from the manufacturer's literature on the system. In the case of the CPU, estimates were made of the fixed delays for operations other than READ based on the amount of work required to accomplish the operation. The amount of CPU for the READ operation was determined from the benchmark data. The module is not critically sensitive to any of these values.
Caching of 1/0 data is an important constraint on the performance of the systems in reality, and therefore requires considerable care to be represented properly in the module. To maintain integrity of the data, only one processor may access the cache data at a time. This means that when the CPU consumption of cache access and management is equivalent to one fully utilized processor, no higher throughput is possible.
In the case of the HP9000, the cache processing was particularly important for obtaining a good fit of the module to the benchmark data. Early plots of throughput response curves from the benchmark data indicated different curves at the device level for different configurations.
We found that the curves were essentially the same for the same number of files.
Coupling this information
with the cache buffer management scheme, we were able to understand that this effect was due to the decrease in the sequential search time on a buffer queue.
After considerable study of the benchmark data it was found that a good approximation for the amount of CPU time due to cache buffer processing could be obtained via a linear function of the number of application files. The y-intercept of this equation is the time to process a buffer when onIy one application file is present, and the slope is negative indicating a decrease in processing per buffer with an increase in the number of files. The slope is multiplied by the number of files and subtracted from the intercept to determine the delay per buffer. This value is then multiplied by the number of buffers in a cache buffer queue to determine the overall delay to search a single queue. This value is corrected for the probability of a cache hit and the fact that on average half of the queue is required to be searched to find the correct buffer.
In the case of the SPARC 2000, data were found in the benchmarks that indicated that the limit had been reached on cache processing. This information was used to determine the cost per 1/0 for cache handling. This is fixed for each 1/0, regardless of configuration, because cache in Sun systems is handled as part of general memory. Therefore, the parameters for calculating cache buffer management were set to zero, and only a fixed time was used to determine the service time and queueing for this part of the module.
The individual disk devices presented the most important and sensitive portions of the module. Disk access is usually described by three parameters, seek time, rotational delay, and transfer time. However, as
Ruemmler and Wilkes (1993) point out, the seek time is not linear, and is itself composed of several parts, acceleration, linear travel, deceleration, and settling. They provide an algorithm based on the distance the arm has to travel for calculating the seek time. This algorithm depends on the dist ante traveled, using one equation for less than one third of the full width of the disk, and a second for traveling the full width of the disk. We chose to approximate this as a two-point probability distribution using the values for a short seek, a long seek, and the probability of a short seek.
The short seek time was set equal to that required to move one track. The long seek time was set equal to the time to move the distance from the middle of the file nearest to the spindle to the middle of the file farthest from the spindle in the benchmark test. The probability of a short seek (which implies the probability for a long seek) was calculated using the average response time for a very lightly loaded disk (essentially no queueing), the average rotation and transfer times, and the short and long seek times. The module is extremely sensitive to this value, because it directly affects the variance of the service time and therefore the queueing at heavy loads. Despite the sensitivity y, once this parameter is established and when the 1/0 is distributed relatively evenly across the disk surface, the module will work for simulations other than the one used to create it. The values we calculated work quite well if the short and long seeks are correctly estimated. Thus, if the device changes or the data layout changes greatly, this parameter should be recalculate ed. Rotational delay and transfer rates are obtained from the manufact urer's lit erat ure. other than the differences in rotational delay and transfer rates, both HP9000 and SPARC 2000 devices used the same calculations for disk device performance.
VALIDATION
Validation of the parameters was coupled with validation of the module in an iterative process. The data for validation came from the benchmarks, and the comparisons were made to throughput-response curves plotted from that data. The goal was to be able to reproduce all the curves from a benchmark run with one set of parameters.
The process was essentially trial-and-error. Multiple runs would be made for a set parameters with varied throughput and an empirical best fit obtained. When the behavior of the module did not correspond to the benchmark system, further research into the details of Unix and the hardware was done, and the module modified to account for the new findings. The process of fitting parameters was then repeated. The module required two modifications from the first edition, one to add queueing behavior to the connection between the device and the SCSI adapter, and the second and more important one to add the cache manager behavior.
Our final validation gave results that could be superimposed on the HP9000 benchmark data with no more than a 570 error at any data point. Because of the difference in the quality of the SPARC 2000 benchmark data when used for our purposes, a fit such as we obtained for the HP data was not possible. However, as mentioned above, the cache processing delay could be obtained as well as the CPU cost per READ 1/0. The manufacturer's literature and experience were used to complete the set of parameters for this platform.
AN APPLICATION
After the basic modeling concept was developed and validated against the HP and Sun benchmark data, the module was implemented in an existing simulation model.
At LEXIS-NEXIS the primary simulation modeling tool used is SLAMSYSTEM.
The basic simulation package has been supplemented with about two dozen FORTRAN subroutines known as the Data Driven Modeling System (DDMS). DDMS was created at LEXIS-NEXIS to allow very large models to be represented by data files and to allow easy modification of these models. A detailed description of the process used at LEXIS-NEXIS is provided by Robinson (1994 Sybase database and a Sun acting as a file server. Four different scenarios representing different levels of business volume were run for both the old 1/0 method and the new 1/0 module. While each run of the simulation using the new 1/0 module took longer than the corresponding run using the old 1/0 method, the time delays and resources consumed as generated by the new Unix 1/0 module were much more realistic. For example, the time required to process a batch of documents was now sensitive to the overall loading (utilization of the CPUS) of the system. Formerly, 1/0 processing times were assumed to be independent of current CPU utilization rates.
DISCUSSION
The potential for the Unix 1/0 module is substantial, since the simulation of every 1/0 to the detail accounted for in the module requires prohibitive amounts of simulation execution time.
Because the module is parameter-driven, considerable attention must be paid to establishing the parameters. Many of the parameters may be obtained from lmanufacturers' literature.
However, for critical parameters, thorough testing and understanding of the workings of the operating system and the hardware is required. The most critical areas are cache management and the disk device.
In our experience, a combination of manufacturers' literature, a det ailed description of the operating system (Leffler, et al. 1990) , and careful analysis of benchmark data to understand cache management was required. Ruemmler's work (1992, 1993) provided the needed information on modeling the physical devices.
To fully and adequately parametrize the module, good benchmark data is essential. We were jfortunate to obtain the HP9000 benchmarks, since they were designed for a different purpose. In the case of the Sun data, we obtained usable data but not as complete as the data for HP. We strongly recommend that specially designed benchmarks be commissioned for proper parameter determination.
The characteristics of a good benchmark include systematically varied loads and configurations, with careful measurements of CPU utilization, response times, 1/0 r~ates, service versus user time, interrupt counts and memory utilization.
When the module was tested by including it in the DDMS model, the run times for a single simulation of each scenario increased by 33-70%. However, the 1/0 module replaced two lines of SLAM code with a large subroutine.
Furthermore, the results obtained with the Unix 1/0 module could not have been obtained in any reasonable computation time with a fully simulated 1/0 model. There are also a number of improvements that could be made to increase the efficiency of the module in the DDMS environment, most of them focusing on the interface between DDMS and the module. We estimate that about 50% of the current increase in run time could be eliminated, resulting in a 15-3570 increase in run time over deterministic met hods.
SUMMARY
A hybrid queueing-simulation module was developed for modeling Unix 1/0. This module is parameterdriven and allows great flexibility in modeling both varying system hardware configurations and varying workloads.
We found that cache management and the physical disk devices were the most important factors influencing the accuracy of the module. We also found that manufacturers' literature could provide many, but not all, of the necessary parameters.
The remaining parameters had to be obtained by iterative validation and modification of the module against benchmarks. As might be expected, benchmarks are a vital link, and should be custom designed and run for this purpose.
The hybrid module runs longer than a deterministic model of 1/0, but provides results that could only be obtained from prohibitively long simulation runs.
