Interfacing a high performance disk array file server to a Gigabit LAN by Katz, Randy H. & Seshan, Srinivasan
NNASA-CR-192898
Interfacing a
April21,1993
High Performance
"J [/
Disk Array
File Server to a Gigabit LAN
Srinivasan Seshan and Randy H. Katz
(NASA-CR-192898) INTERFACING A
HIGH PERFORMANCE DISK ARRAY FILE
SERVER TO A G!GABIT LAN
(California Univ.) 19 p
N93-25658
Unc|as
G3/62 0158663
Interfacing a High Performance Disk Array File Server to • Gigabii LAN !
https://ntrs.nasa.gov/search.jsp?R=19930016269 2020-03-17T06:48:28+00:00Z
Ap_l21,1993
Ourpreviousprototype,RAID-I,identifiedseveralbottlenecksin typicalfile server architectures
[Chervenak91 ]. The most important bottleneck was the lack of a high-bandwidth path between disk, mem-
ory and the network. Workstation servers, such as the Sun-4/280, have very slow access to peripherals on
busses far from the CPU. For the RAID-II system, we addressed this problem by designing a crossbar inter-
connect, Xbus board, that provides a 40MB/s path between disk, memory and the network interfaces. How-
ever, this interconnect does not provide the system CPU with low latency access to control the various
interfaces. To provide a high data rate to clients on the network, we were forced to carefully and efficiently
design the network software. A block diagram oft.he system hardware architecture is shown in Figure 1. In
the following subsections, we describe pieces of the RAID-II /'de server hardware that had a significant
impact on the design of the network interface. Other papers, fLee92., Katz93], describe the architecture and
implementation of the RAID-II server in greater detail.
SB
IPPID Bus
VME Ribbon
Cable Segments
Control'Paths
Four VME Disk Controllers
FIGURE 1. : RAID-II Organization. A high-bandwidth crossbar intercormea ties the network
interface (HIPPI), the disk controllers, a muhiponed memory system, and a parity computation engine.
An internal control bus provides access to the crossbar ports, while external point-to-point VlvfE links
provide control paths to the surrounding SCSI and HIPPI interface boards. Up to two VME disk
controllers can be auached to each of the four VME interfaces. The desil_n originally had 8 memory
ports and 128 MB of memory; however, we built a four memory port version to reduce manufacturing
time.
Interfacing a High Performance Disk Array File Server to • Gigabit LAN 2
fApril21, 1993
2.1 VME Link Boards
The remote V/VIE links that connect the host CPU to the other sections of the system are extremely
slow: about 2Mbytes/second for most applications. Single word transfers across the link take 2us each. In a
few select applications, the link board can DMA data at up to 20Mb_es/second. To meet our performance
goal for data transfer to the network, very little data can be transferred between the host CPU and the other
parts of the system.
2.2 TMC I/O Backplane
The TMC UO backplane consists of two unidirectional busses, HIPPIS bus and HIPPID bus, that move
data between the TMC HIPPI boards and the XBUS board. Both busses are addressless and only the TMC
boards may be the bus master. We used a simple bus protocol that allows the TMC board to select a target or
source board for the transfer and to do flow control. Since the bus is addressless, any source or target device
(for example the XBUS board) must be set up before the transfer is initiated.
2.3 Host
The host CPU in the RAID-I] server is a Sun-4/280 single board computer. It has 32MB of VME mem-
ory and runs the Sprite operating system. The CPU performance of this machine is approximately 8 SPEC-
Marks. The host is responsible for running most of the code that controls the RAID-I] server. It is
responsible for running the file system code and controlling the drive interfaces, XBUS board and HIPPI
boards. The host is slow by today's standards and is expected to be heavily loaded with file system and con-
trol tasks. It is important that the network interface not place a significant additional load on the CPU.
2.4 XBUS Board
The X:BUS card implements a 4-by-8 32-bit wide crossbar bus. This board provides a high bandwidth
path between the disk controllers, memory and the network interface. Two of the crossbar ports provide con-
nections to the TMC I/O backplane. Since the TMC I/O backplane busses are addressless, the Xbus board
must be set up for any transfers across the backplane in advance. These ports can sustain 40Mbytes/second
of transfer to/from the TMC HIPPI boards. A control VME connection uses another single port. The Xbus
board is controlled by a set of registers present on this VME interface. These registers can be written to by
the TMC HIPPI boards or the Sun-4 CPU. Other ports provide connections to a 32MB memory, a hardware
XOR compute engine and four disk controller boards.
2.5 SCSI controllers
The X:BUS board is connected to a set of four VME busses. Each of these VME busses currently con-
tains an Interphase Cougar SCSI controller. Each board is capable of handling approximately 7 Mbytes/sec-
ond of data traffic from two independent SCSI strings. Physical packaging limits each string to 3 disks.
Interfacing aHigh Performance Disk Array File Serverto • Gigabit LAN 3
AI_] 21, 1993
These boards limit the RAID-II system to a maximum of 28Mbylesdsecond of disk bandwidth and 24 disks.
Software and other bottlenecks may limit the performance further. Future SCSI boards will allow the system
to use 72 disk drives and to provide up to 32Mbyles/second per XBUS board.
2.6 TMC HIPPI Boards
The HIPPI interface for RAID is implemented using a two board set built by Thinking Machines Cor-
poration (TMC). The architecture of the boards is shown in Figure 2, "TMC HIPPI Board Block Diagram,"
on page 4. Each board contains an interface to a single direction of the HIPPI channel, a unidirectional back-
plane bus and a control VME bus. Each board also contains a AMD 29000 (29K) processor and some local
memory. Programs and data for the 29K processor can be downloaded from the VME control bus. The 29K
processor can be used to set up transfers and to run any general purpose code (protocol code). Some signifi-
cant differences exist between the two boards. The individual boards are described in more detail in the next
few subsections.
• Source Board:
HIPPI HIPPIS Busal
29K CPU
VME Bus
Destination Board:
HIPPI HIPPID Bus_
I P
VME Bus
FIGURE 2. TMC HIPPI Board Block Diagram
2.6.1 HIPPI Source Board
The HIPPI source board interfaces to the VME bus through a set of five registers. The functions of
these registers are summarized in Table 1. The input and output FIFO are the most important of these regis-
ters since they provide the only general purpose communication interface between code running on the 29K
processor and the host CPU. Since the source board has no VME bus mastering capability, data must be cop-
ied into the input FIFO and from the output FIFO. This VME interface has two important consequences.
Interfacing a High PerformanceDisk Array File Servertoa Gigabit LAN 4
April21, 1993
TABLE 1. HIPPI Source Board VME registers
VME Register Description
Configuration Sets up V/vIE functionality of board - enable interrupts,
set VME address modifier, etc.
Input FIFO Receives data from VME into FIFO read by 29K.
Output FIFO Stores data written by 29K into a FIFO that can be read
from the VME.
Status Stores current status of board
Reset Controls reset of various sections of board
First, the Xbus board must be set up for transfers to the source board by some other part of the system (e.g
host CPU). Second, since there is no mechanism to lock access to the FIFOs, the Sun-4 CPU and the destina-
tion board cannot both communicate with the source board. Therefore, to prevent mixing of data from two
sources, we limit access to only the Sun-4 CPU.
The source board's interface to the HIPPI output channel and the input backplane bus is controlled by a
set of on-board registers. These registers are only accessible by the 29K processor. A single FIFO is used to
store data to be sent out on the HIPPI channel, A simple state machine fills this FIFO with data from the
TMC I/O back-plane. The 29K initiates t_his transfer by writing various registers. It must know the total
length of the transfer in advance. By not involving the 29K in copying data from the backplane, the system
can achieve the full HIPPI bandwidth (100MbytesCsecond).
2.6.2 HIPPI Destination Board
The destination board includes several more features than the source board. The VME interface has
several new registers that provide general purpose communication with the host. The destination board
VME registers are summarized in Table 2. In addition to the new registers, the destination board supports
VME bus mastering. The board can write data from the output FIFO to any VME location. Similarly, it can
read VME locations to the input FIFO. As a result, the HIPPI destination board can set up the XBus board
for transfers.
The destination board's interface to the HIPPI input channel is composed of a set of registers accessible
by the 29K. Data from the HIPPI channel is automatically placed in a FIFO, which can be copied by a state
machine to the backplane. The 29K must set up the transfer to the backplane by writing the length of the
transfer to an on-board register. The 29K must then poll the status of the state machine to identify the end of
the transfer.
Interfacing aHigh Performance Disk Array File Serverto • Gigabit LAN 5
April21, 1993
TABLE 2. HIPPI Destination Board VME Registers
Register Description
Configuration Sets up VME functionality of board - enable interrupts,
set VME address modifier, etc.
Input FIFO Receives data from VME into FIFO, can be read by 29K.
Output FIFO Stores data written by 29K into a FIFO that can be read
from the VME.
Status Stores current status of board
Command Stores data written from V'ME, can be read by 29K.
Response Stores data written by 29K, can be read from VME.
Reset Controls reset of various sections of board
2.7 Ultranet
The UltraNetwork is a hub-based store and forward network capable of transmission rates up to lGbiV
second. Figure 3 shows our Ultranet topology. The hubs create a high speed switching interconnection by
routing incoming packets to the proper destination.
Link Adapters
.d I.,,.I ): sOOMblsI aAID-tl
\
Sun31 I Sun4I I su,,4 I Hub-basedH
I VME I VME VME Adapter
FIGURE 3. UC Berkeley UltraNetwork Topology
Hubs are physicaUy connected by serial links capable of transmission rates of 250Mbits/second. Up to
4 links can be used between a pair of hubs. Data is striped across these links to achieve Gbit/second speed.
These ]inks terminate in link adapters in the hubs. Link adapters are also used to connect to machines with
Ultranet host adapters. Host adapters are available for machines with industry standard backpIanes (e.g.
VME). Each host adapter contains an on-board microprocessor and can perform DMA to the host's memory.
The on-board microprocessor does all the protocol processing necessary to communicate across the
UltraNetwork to remote clients. Computers without standard backplanes, typically mainframes and super-
computers, can connect to the UltraNetwork using standard channel interfaces (for example HIPPI, HSX) to
a hub-based adapter. This essentially moves the network interface into the hub itself. Much of the UhraNet-
work protocol is handled by the processor on the hub-based adapter. However, software must run on channel
Interfacing a High PerformanceDisk ArrayFile Serverto • Gigabit LAN 6
April 21, 1993
connected hosts to handle communication to the hub-based adapter. This software is described in more detail
below.
VME Ultranet host adapters in a Sun system provide a maximum of about 4Mbytes/second to the net-
work. On the basis of the RAID-II performance goal of 40Mbytes/second, we decided that a I-IIPPI attach-
ment to the drive array was necessary.
Each transfer between the UhraNetwork hub and the hub-adapter attached host is composed of a DMA
word followed by either a request block or data. The maximum size of the data segment of each transfer is
limited to 32KB by the Ultranet adapter. The DMA word accompanying each transfer describes the contents
of the transfers. Analyzing the DMA word provides sufficient information to identify the correct memory
destination for the transfer. Request blocks are commands that pass between the hub-based adapter and the
host. Each request block roughly has an analogue in BSD 4.2 network socket calls. This made it easy to pro-
vide the file system with a socket interface to the network. Several of the most important request blocks are
summarized in Table 3. Only a few standard data formats are used to transmit the various request blocks. As
a result each request block requires sending significantly more data than is necessary.
TABLE 3. UltraNetwork Request Blocks
Request Block BSD Equivalent
OPEN socket()
ADAPTER LISTEN combination of bind(), listen() and accept()
CONNECT connea0
CLOSE close()
SEND send()
RECEIVE recv0
3.0 Software Architecture/Implementation
Both TMC and Ultranet provided software to support the original uses of their systems. After examin-
ing the provided code, we decided that completely new software was needed for several reasons. First, the
RAID-I] file server runs the Sprite Operating System. Both the TMC and Ultranet software were developed
for Sun-OS and needed a significant amount of work to port to Sprite. Second, the software was developed
to support the more standard machine interconnection. As a result it could not provide the high performance
we needed on the RAID system. In this section, we describe the organization of the networking software we
developed for the RAID-II file server. We examine the decisions made during the software implementation
and the reasoning behind these decisions.
Interfacing a High Performance Disk Arrty File Server to • Gigabit LAN 7
April21,1993
3.1 Architecture
The interface provided to the file system code and the division of code between the 29K and Sun-4
CPU were two basic issues of the software architecture. Based on the Ultranet request block format, we
decided to provide the file system code with a socket interface to the network, making both the networking
and file system code easier to implement. Also, we decided to implement most of the software in the 29K for
a variety of reasons. First, it had been estimated that the Sun-4 CPU would be heavily loaded by running the
file system software and con_olling the hardware of the RAID-II system. Second, the connection of the Sun-
4 CPU to the rest of the system is through slow VME link boards. The involvement of the Sun-4 CPU in data
transfers would reduce the bandwidth of the RAID-I] server significantly. To support a high bandwidth
between the network and memory, the 29K CPUs must control as much of the data transfer as possible. The
29K CPUs were programmed to understand the Ultranet request block interface and handle incoming data
transfers. However, since access to the source board cannot be shared, the Sun-4 CPU must set up the outgo-
ing data transfers. The software architecture is shown in Figure 4. An example transfer is described in the
next section to clarify the software architecture.
l HIPPI Channels
Ultranet Request Blocks tSocket Interface
HIPPI Source 29K
Ultranet Request Blocks I
ISocket Interface
HIPPI Dest, 29K
VME Link
Sun4 Host CPU
Socket Interface
File System
FIGURE 4. Software Architecture Division
VME
3.1.1 Sample Transaction
This section describes a sample network transaction that may occur between the RAID server and a cli-
ent on the Ultranet. In this example, the client creates a connection to the server, sends some data and
receives a reply. This communication is graphically shown in Figure 5.
Interfacing a High Performance Disk Array File Server to • Gigabit LAN 8
April 21, 1993
1. The file server will start by issuing an open () of a socket. This will result in the HIPPI source board
sending out an OPEN request block. The HIPPI destination board will receive the completed OPEN
request block from the Uhranet hub. The destination 29K interprets the request block and returns a new
socket id to the file server and the open ( ) call completes.
2. The file server issues a listen ( ) on the socket id. This is accomplished by the source board sending
an ADAPTER LISTEN request block to the Ultranet hub. listen ( ) is a combination of BSD bind0,
listen0 and accept(). When a client creates a connection to the file server the completed ADAFrER LIS-
TEN request block returns to the destination board. The destination board sends information about the
established connection to the file server and the listen ( ) call completes.
3. The file server does a recv ( ) on the connected socket id. A pointer to an empty host buffer and a tag
that uniquely identifies the transfer are passed to the destination board. The source board sends a
RECEIVE request block to the Ultranet hub. The Uhranet matches up the request block with a client's
send of data and transfers the data to the HIPPI destination board. The destination board uses the unique
tag to identify the transfer. The destination board then sets the XBus board up for the transfer and begins
the transfer to the backplane. A completed RECEIVE request block is sent by the Ultranet hub after the
transfer completes. The destination board sends the request block status to the file server and the
recv () completes.
4. The host will issue a send ( ) ofthe reply data on the socket id. A pointer to the host data buffer and a
tag that uniquely identifies the transfer are passed to the destination board. The source board sends a
SEND request block to the Ultranet hub. The Uhranet matches up the request block with a client's
receive of data. The Ultranet hub sends a request to the HIPPI destination board to begin Iransfer of the
data. The destination board uses the unique tag to identify the transfer request and determine the data to
be sent. The destination board requests the Sun-4 to set up the Xbus board and source board to transfer
the desired data to the Ultranet hub. A completed SEND request block is sent to the destination board by
the Ultranet hub after the transfer completes. The destination board sends the request block status to the
file server and the send () completes.
3.2 Source Board Code
From the example transfer, it should be evident that the HIPPI source board must send both data and
request blocks to the Uhranet hub. The commands to perform these aaions are summarized in Table 4.
These commands are executed by the Sun-4 CPU writing to the source boards VME FIFO. In general, the
Ultranet request block formats contain many data fields that can be eliminated. Commands between the 29K
and Sun-4 CPU contain only the essential fields of the associated request blocks. The effectiveness of this
"compression" is discussed in Section 4.2, "Reduction of VME Link Traffic".
Interfacinga High Performance Disk ArrayFile Serverto • GigabitLAN 9
April 21,1993
Sun-4 HIPPI Ultranet Client
CPU Boards Hub
listen n..RB connec
I "7
FIGURE 5. Communication between Sun-4 CPU, TMC HIPPI Boards,
Ultranet Hub
TABLE 4. Commands between Source 29K and host CPU
Command
UltraOpen ()
UltraListen()
UltraClose()
UltraSend()
UltraRecv()
SendData()
Description
Sends request to UhraNetwork to open
a socket.
Sends request to bind Socket ID to a
port. Then listens for a connection and
accepts it.
Sends request to close connection
active on a socket ID.
Sends request to send data on the con-
neaion associated with a socket I'D.
Each send request block is given a
unique 8 bit tag that identifies it.
Sends request to receive data on the
connection associated with a Socket
ID. Each receive request block is given
a unique 8 bit tag that identifies it.
Sends a requested number of bytes
from the Sun4 and the XBUS on the
HIPPI channel.
Interfacing a High Performance Disk Array File Server to • Gigabit LAN 10
April 21, 1993
3.3 Destination Board Code
To support the example transfer, the destination board needs to interpret the incoming Ultranet request
blocks and scaUer-gather Ultranet data requests. The Sun-4 uses the command described in Table 4 to notify
the destination board of buffers allocated for both incoming and outgoing transfers.
TABLE 5. Commands Between Host CPU and Destination 29K
Command Description
ScatterGather() Allocates buffers in both Sun--4 and
Xbus Memory for a lransfer associated
with a specific tag
The destination board must also read and interpret all transfers from the Ultranet. Every HIPPI trans-
fers sent from the U]tranet hub to the destination beard starts with the following DMA word structure.
31 .... 24 J 23 .... 16 115 .... s lv..... 0
1
Content Description
Transfer Offset
Tag
Transfer Length
Content Description identifies if the transfer contains a request block, a dam request or data. If
the current transfer is part of a larger multipart data transfer (larger than 32KBy_e transfer), Transfer
Offset provides the byte offset of the data being sent into the entire transfer. Tag is the unique identifier
for every send or receive of data. Transfer Length is the byte length of the current transfer. Due to
buffering limitations in the Ultranet hub, Transfer r.ength is never more than 32KBytes.
3.3.1 Completed Request Blocks
The destination board 29K must notify the Sun-4 of any completed request blocks it receives. When a
request block arrives at the destination board, the VME DMA engine is used to copy the essential fields
(same fields that are used by the source board to send a request block) of the request block into the Sun-4
CPU's main memory. Next, the destination board interrupts the Sun-4 CPU to notify it of the completion of
a Ultranet request. The host CPU may then examine the completed request block for either status or returned
values
3.3.2 Incoming Data
When incoming data arrives at the destination board, the 29K processor uses the tag, transfer
offset and transfer length fields of the DMA word and previously processed ScatterGather0 com-
mands to determine the destination of the data. If the destination address of the data is in host memory, the
29K removes the data from the HIPPI channel and DMA copies the data to the proper VME location. How-
Interfacing a High Performance Disk ArrayFile Sewer to a Gigabit LAN 11
AWi]21,1993
ever,if the data should be placed in XBUS board memory, the destination board sets the XBUS board up for
the transfer by writing to the XBUS V/vIE registers. Next, the 29K enables the state machine to copy data
from the HIPPI channel to the XBUS board. The data transfer is complete when the state machine finishes.
3.3.3 Outgoing Data
When a Uhranet request for data arrives at the destination board, the 29K processor uses the tag,
transfer offset and transfer length fields of the DMA word and previously received Scatter-
Gather() commands to determine the source of the data. The destination board cannot use its VME DMA
engine to set up the transfer for several reasons. First, access to the source board VME FIFO cannot be
shared by the host and the destination board. Second, the destination board's VME DMA engine reads data
into the destination board's VME input FIFO. However, the host CPU must also access this input VME
FIFO. Access to this FIFO cannot be shared. As a result the host CPU must set up the transfer of data. The
destination board copies the length and source address of the transfer to its VME output FIFO and interrupts
the host CPU. The host CPU uses the length and source address to set up the XBUS and HIPPI source board
for thc transfer. This is done by writing to the XBUS control registers and issuing the SendData0 command
to the HIPPI source board.
3.4 Implementation
The approximate size of code running on the TMC HIPPI boards is summarized in Table 6.
TABLE 6. HIPPI Board Code Statistics
Section
Destination Board C Code
Lines
of Code
3500
Shared TMC Boards Assembly
Source Board C Code 3500
Shared TMC Boards C Code 1500
700
Estimated
Man Hours
900
Almost 7000 additional lines of code were written for the Sun-4 host CPU to support the UhraNetwork
and HIPPI boards. Much of the code and time can be attributed to the lack of documentation for the Ultranet
and the poor match between the architecture and the Uhranet protocol.
4.0 Performance Measurements
In this section, we examine the end-to-end network performance of the RAID-I/file server. We analyze
measurements of network bandwidth, CPU load of the RAID-II system and system hardware bandwidths to
identify the bottlenecks that limit the network performance of the system.
Interfacing aHigh Performance Disk Array File Serverto aGigabit LAN 12
April 21, 1993
4.1 RAID-I] Hardware Performance
The RAID-If system was designed to support a 40MBytes/second data path between disk, memory and
network. The performance of the system is carefully analyzed in [Chert93]. Measurements show that trans-
fers between the XBus board memory and the TMC HIPPI boards have a latency of 1.1ms and a maximum
throughput of 38.5MBytes/second. The majority of this latency is attributed to the configuring of the XBus
board and the handling of the HIPPI channel by software on the TMC board. These measurements were
taken on a system with minimal software on the host CPU and on the 29K CPUs. They indicate the maxi-
mum achievable performance from the RAID-I/hardware.
4.2 Reduction of VME Link Traffic
To improve network performance of the RAID-II system, we include only the essential fields of Ultra-
net request blocks in the messages between the Sun-4 and the HIPPI boards. This was done to reduce the uti-
lization of the slow VME link between the Sun-4 and HIPPI boards. This link is capable of handling
2MBytes/second. The link must carry messages between the Sun-4 and HIPPI boards and file system meta-
data. The "compression" achieved is summarized in Table 7. On the average, messages are reduced in size
by 50%. Unfortunately, the utilization of the link is not easily measurable. As a result, we cannot identify if
TABLE 7. Ultranet Request Block Size in Bytes
Normal Compressed
Request Block Size Size
OPEN 44 20
LISTEN 92 56
CLOSE 92 20
SEND 44 24
RECEIVE 44 24
it is near saturation.
4.3 Network Performance
The UhraNetwork currently installed at UC Berkeley supports three Sun VME workstations. Each Sun
workstation can produce or consume approximately 3.SMBytes/second [Clinger89]. This provides a maxi-
mum aggregate bandwidth of 10.5MBytes/second. RAID-II is capable of completely satisfying this network
load (and more). Under the current maximum load, all clients receive data at their full desired bandwidth.
Therefore, bandwidth limitations of the RAID-I/network interface can currently only be estimated from
scaling arguments. The performance numbers reported are based on a thousand packets of a fixed size sent
over a single connection between the Xbus memory in RAID-II and a client machine on the Ultranet. The
Interfacing a High Performance Disk Array FiFe Server to • Gigabit LAN 13
time to complete these transfers was used to obtain both average bandwidth and latency measurements for
various packet sizes.
Figure 6 shows the bandwidth of data for different sized packets being sent between RAID-II and indi-
vidual clients. The bandwidth of a Cray supercomputer communicating with a single Sun-3 client is shown
for comparison. SunOS 3.5 operates approximately 10-15% faster than SunOS4.1. The maximum band-
width for the Sun-3 clients is 3.SMBytes/second reading data from RAID-I] and 3.7MBytesdseeond writing
data to RAID-II. The maximum bandwidth for the Sun-4 clients is 3.0MBytes/second reading data from
RAID-I] and 3.SMBytes/second writing data to RAID-I]. This large performance gap reading and writing
data from a Sun-4 is due to cache conflicts in the Sun-4 memory system. When data is being written to the
Uttranet Read From Server (bandwidth)
4000.00
3500,00
J_ 300000
m_" 250000
_" 2ooo.oo
_5
1500.00
1000.00
500.00
0.00 ! I
w
D
I I
65536 131072 196608
Packet Size (Bytes)
2_1_
-'-'4"-- RAID to Sun-3
RAID to Sun-4
__.m Cray-2 to Sun-3
Ultranet Write To Server (bandwk:llh)
450000
4000.00
_ 3,500.00
3000 00
25oo.oo
i 2000.001500.00
_ 1000.00
500.00
0.00
0 65536
! _ I
131072 196608 262144
Packet Size (Bytes)
Sun-3 to RAID
.--.o--- Sun-4 to RAID
_'-- Sun-3 to Cray-2
FIGURE 6. Bandwidth vs. Packet Size for transfers between RAID.II and a
single client
Sun-4 memory from the network, the virtually addressed cache in the Sun-4 must be updated. This results in
a lower bandwidth writing to the Sun-4 memory.
Interfacing a High Performance Disk Array File Server to a Gigabit LAN 14
Apn! 21, 1993
Figure 7 shows the latency to send different sized packets between RAID-I] and individual clients. The
performance of a Cray supercomputer communicating with a single Sun-3 client is shown for comparison.
The minimum latency of packets for a Sun-3 is 6.0ms reading from RAID-I] and 4.Sins writing to RAID-
I/.The minimum latency of packets for a Sun-4 is 2.2ms reading from RAID-I/and 1.3ms writing to RAID-
17. Measurements of the RAID-II hardware [Chen93] indicate that approximately 1.1ms of this latency is
due to delays in the file server. These numbers indicate that it is the processing speed of the clients that limits
the end-to-end latency of communication.
UI1ranet Read From Server (latency)
A
b3
E
E
W.
45.00
40.00
35.00
30.00
25,00
20 O0
15.00
10,00
5.00
0.00 I I _ I
0 32768 65536 98304 131072
Packet Size (Byles)
---,--- RAID to Sun-3
RAID Io Sun-4
--'-- Cray-2 to Sun-3
Ultranet Write To Serv_ (latency)
4000
35.00
30.00
- 2500 _ Su_3 Io RAID
"_ 20.00 _ Sun-4 Io RAID
E
_- 15,00 --'-- Sun-31oCray-2
10.00
5.00
0.00 n i _ g
32768 65536 _304 131072
Packet Size (Bytes)
FIGURE 7. Latency vs. Packet Size for transfers between RAID.II and a single
Applications on the clients are unable to consume the 3.5Mb_es/second of data delivered. For exam-
ple, video stored on the RAID-II file server can be played back on the Sun-4 clients at a rate of 5 frames/sec-
ond. This corresponds to a transfer rate of 1.5Mbytes/second. The video data is copied across the clients
VME backplane twice, first from the network interface to memory and then from memory to the frame
buffer. This contention for the VME backplane reduces the available bandwidth in half.
Interfacing a High Performance Disk Array File Server to a Gigabit LAN 15
April21,1993
4.4 CPU Utilization
The network software for RAID-II splits the workload of network communication across three proces-
sors, the Sun-4 host CPU and the two AMD 29K CPUs on the HIPPI boards. In this section, we examine the
CPU utilization of these processors during transfers.
The utilization of the Sun-4 CPU is highly dependant on the packet size of the transfers occurring. Fig-
ure 8 shows the utilization of the Sun-4 CPU when all three clients transferring data. The three clients con-
sume/create approximately 10.SMBytes/second of data u'affic. When the clients are writing data to RAID-II
the host CPU must take a single interrupt per packet. As a result the load on the host CPU is inversely pro-
poniona] to the packet size. When clients are reading data from RAID-II, the host CPU must be interrupted
for every outgoing data fragment transfer requested by the Ultranet hub. All packets are fragmented into
32Kbyte transfers across the HIPPI channel. As a result, the host CPU utilization has a minimum of 48%.
CPU utilization remains almost constant for packets larger than 32Kbytes.
5
g
10o
9o
80
70
6O
50
40
30
20
10
0
Sun-4 Ho_ Utile=ion
3 Clients simu_aneously tmnsfedng
t _
0 65536 131072
Pack_ S_e
1966O8
---'--- RAIDtoClients IClients to RAID
-o
I
262144
FIGURE 8. Sun-4 Host CPU utilization vs. Packet Size for 3 clients
communicating with RAID-II
The host CPU utilization limits the network performance of clients reading from RAID-II to
21MBytes/second, about twice the currently available performance. Since packets on the UltraNetwork can
be several megabytes, the host utilization places no limits on the bandwidth of clients writing data to the
RAID-I] system
The utilization of the 29K CPUs on the HIPPI boards depends mostly on the bandwidth of data being
transferred. This is due to the fact that the 29K processors have a fixed computation overhead per 32KByte
fragment transferred on the HIPPI channel. Their utilization is, therefore, not dependant on packet size but
only on the actual bandwidth of data. Table 8 shows the utilization of the 29K CPUs for different bandwidths
of data. When writing data to RAID, the destination board is highly utilized since it must set up and perform
Imeffacing a High Performance Disk Array File Server to a Gigabit LAN 16
April21, 1993
the data transfers. For the same transfers, the source board only processes outgoing request blocks. During
reads from the RAID system, the source board must perform the overhead of transferring the data. The des-
tination must still set up the transfers of data.
TABLE 8. Utilization of 29K Processors during Network Transfers
Read From RAID Write To RAID
Bandwidth
(Mbytes/ Source Destination Source Destination
second) 29K 29K 29K 29K
3.5 21% 7% 18% 14%
7.0 N/A 18% 16% 23%
10.5 35% 20% 18% 27%
These numbers indicate that the 29K CPUs would limit the network to approximately 32Mbyte.s/sec-
ond for both reads and writes to RAID-II.
5.0 Conclusions
The two basic goals of the RAID-I/network software were to provide high bandwidth to clients on the
UltraNetwork and reduce the load on the host CPU. [Chen93] measurements indicate that the RAID-II sys-
tem hardware can support a raw bandwidth of 38.5MBy_es/second between memory and the network. Based
on our scaling estimates, the RAID-II server can source approximately 21Mbytes/second to the Ultranet
(limited by the host CPU) and sink 32MBytes/second (limited by the destination board 29K CPU) from the
network. Upgrading the host CPU to more modem hardware would allow the RAID-II system to source
32MB_es/second to the network. This bandwidth is significantly higher than that of Ethemet-based file
servers in our environment. For comparison, our Sprite OS file server supports a bandwidth of about
1MB._ae/second to the network [Welch90]. These results show that the RAID-II network interface was effec-
tive at providing a high bandwidth to clients on the Ultranet. Although the software design did reduce the
load on the host CPU by effectively using the 29K CPUs, we could not prevent the host CPU from being a
critical resource for sourcing data. We feel that the network performance of the RAID-I/server with Uhranet
clients cannot be improved significantly.
With some minor hardware changes, there are a number of mechanisms to improve the performance of
the system to the maximum 38.SMbytes/second First, the limiting CPU utilizations could be reduced by
sharing access to the HIPPI source board by the host CPU and the HIPPI destination board. The sharing
would make it unnecessary to interrupt the Sun-4 host every 32Kbytes. However, this sharing is impossible
to achieve efficiently without an improved V/vIE interface on the HIPPI boards. Another possibility would
be using larger packets to communicate to/from the TMC HIPPI boards. The Uhranet hub architecture cur-
Interfacing a High Performance Disk Array File Server to a Giglbit LAN 17
rentlylimitsusto 32Kbytetransfers.Theutilizationof both the 29K CPUs and the Sun-4 CPU would
greatly be reduced by the use of larger packets. This would allow us to scale to much higher bandwidths. To
increase the packet size, we plan on replacing the Ultranet with a HIPPI switch network. Using the HIPPI
switch network we hope to support transfers at over 70Mbyles/second to a pair of XBus boards.
6.0
[AnonA]
[AnonB]
[AnonC]
[ANSI91 ]
[Chen93]
[Chervenak91]
[Clinger89]
[Kat z91 ]
[Katz93]
['Lee92]
[Patterson88]
References
Network Operations Manual, UltraNetwork Technologies,Pan Number 06-
0001-001, Revision A, (1990). Chapter 2: UltraNet Architecture; Chapter 3: UltraNet
Hardware.
HPPI Destination Module (HPPID) Hardware Specification. Think-
ingMachine Corp.October1990.
HIPPI Source Interface Hardware Register Specification. Think-
ingMachine Corp. September 1990.
High-Performance Parallel Interface - Framing Protocol
(HIPPI-FP), American National Standard for InformationSystems X3T9.3/89-013
Rev 4.2.June 1991.
Peter M. Chen, Edward K. Lee, Ann L. Drapeau, Ken Lutz, Ethan L. Miller, Srinivasan
Seshan, Ken Shirriff, David A. Patterson, Randy H. Katz. Performance and Design Evalu-
ation of the RAID-II Storage Server. to appear in Internatl ona2 Parallel Pro-
cessing Symposium 1993 Workshop on I/O.
Ann L. Chervenak and Randy H. Katz. Performance of a Disk Array Prototype. Pro-
ceedings of the 1991 ACM SIGMETRICS Conference on Measure-
ment and Modeling of Computer Systems, volume 19,pages 188-197,May
1991.
Marke Clinger. Very High Speed Network Prototype Development; Task 2.1: Measure-
ment of Effective Transfer Rates. ultra Network Technologies. October 1989.
Randy H. Katz. High Performance Network and Channel-Based Storage. Proceedings
off the IEEE, Vo2 80, No. 8. pages 1238-1260. August 1992.
Randy H. Katz, Peter M. Chen, Ann L. Drapeau, Edward K. Le_, Ken Lutz, Ethan L.
Miller, Srinivasan Seshan, and David A. Patterson. RAID-H: Design and Implementation
ofa Large ScaleDisk Array Controller.1993 Symposium on Integrated Sys-
tems, 1993.UniversityofCaliforniatBerkeleyUCB/CSD 92/705.
Edward K. Lee, Peter M. Chen, John H. Hartman, Ann L. Chervenak Drapeau, Ethan L.
Miller, Randy H. Katz, Garth A. Gibson, and David A. Patterson. RAID-H: A Scalable
Storage Architecture for High-Bandwidth Network Fde Service. Technical Report UCB/
CSD 92/672, University of California at Berkeley, February 1992.
David A. Patterson, Garth Gibson, and Randy H. Katz. A Case for Redundant Arrays of
Inexpensive Disks (RAID). In International Conference on Management
of Data (SIGMOD), pages 109-116,June 1988.
Interfacing a High Performance Disk Array File Server to •Gigabit LAN 18
April 21, 1993
[Welch90] Brent B. Welch. Naming, State Management, and User-Level Extensions in the Sprite Dis-
tributed File System. University of California at Berkeley UCB/CSD 90/567. April 1993
Interfacing a High Performance Disk Array File Server to a Gigabit LAN 19
