Comparing multimedia storage architectures by Gennart, B. A. & Hersch, R. D.
Comparing Multimedia Storage Architectures
Benoit A Gennart Roger D Hersch
Swiss Federal Institute of Technology EPFL  Lausanne Switzerland
fgennartherschgdiepch
Abstract
Multimedia interfaces increase the need for large
image databases capable of storing and read
ing streams of data with strict synchronicity and
isochronicity requirements In order to fulll these
requirements we use a parallel image server ar
chitecture which relies on arrays of intelligent disk
nodes each disk node being composed of one proces
sor and one or more disks This contribution analyzes
through simulation the realtime behavior of two multi
processor multidisk architectures  the GigaView and
the Unix workstation cluster The GigaView incorpo
rates pointtopoint communication between processing
units and the workstation cluster supports communi
cation through a shared busandmemory architecture
For a standard multimedia server architecture con
sisting of  disks and  disknode processors we eval
uate stream frame access times under various param
eters such as load factors frame size stream through
put and synchronicity requirements We compare the
behavior of the GigaView and the workstation cluster
in terms of delay and delay jitter
 Introduction
A highperformance highcapacity image server must
provide users located on local or public networks with
a set of adequate services for immediate access to im
age video and sound streams stored on disk arrays
The RAID concept  oers very high bandwidth disk
arrays hooked directly onto highspeed networks The
multiprocessor multidisk MPMD	 approach we use
associates disks and processors so as to form an array
of intelligent disk nodes capable of applying in paral
lel local preprocessing operations before sending data
from the disks to the client workstation We have
shown that such preprocessing operations are highly
valuable in the case of image accesses 
 large pixmap
images can be reduced into displayable size images
at disk reading speed  Multimedia applications
where bandwidth must be carefully controlled benet
from such preprocessing capabilities In the MPMD
approach pixmap image data is partitioned into rect
angular extents each extent having a size which min
imizes global access time In order to ensure high
throughput contiguous image extents are allocated
on dierent disk nodes The MultiDimensional File
System MDFS	 developed at EPFL  handles data
partitioning and allocation on multiple disk nodes
The authors have implemented an MPMD im
age server called the GigaView A disk T
transputerbased architecture connected through a
SCSI standard interface to a host computer Mac
Intosh Unix Workstation	 provides a throughput of
up to MBytess and the ability to browse through
images and maps of arbitrary size at the rate of three
to four by bytepixel visualization windows
per second Future implementations of the GigaView
will rely on the faster T transputer which can
support up to  disks hooked in parallel and sustain
a throughput of approximately MBytess
This contribution analyzes through simulation the
realtime behavior of the GigaView in terms of
throughput and delay jitter It compares the per
formance of the GigaView to the performance of
a generalpurpose multiprocessor multiSCSIchannel
highend UNIX workstation cluster For highend im
age server architectures consisting of  disks and 
processors we evaluate stream frame access times un
der various parameters such as load factors frame size
stream throughput and synchronicity requirements
In this contribution we consider reading multimedia
streams stored on disk without any processing oper
ation such as compression decompression or resam
pling This allows us to highlight the overhead due to
data transfers within the image server architectures
Future contributions will take into account operations
that can be executed in parallel
Our approach is to evaluate through experiments
on singleprocessor singledisk workstations individual
component performance e g local processor memory
to global memory bandwidth local processor memory
bandwidth disk throughput and latency	 and use the
performance parameters to build simulation models
of MPMD architectures We then evaluate through
simulation the performance of the modeled MPMD
architectures
The result of the analysis is that despite lower in
dividual component performance the pointtopoint
communication scheme supports higher throughputs
at the application level and scales better to higher
performance architectures
Section  describes the MultiDimensional File Sys
tem MDFS	 the GigaView multiprocessor multi
disk architecture and the architecture of a multi
processor multiSCSIchannel workstation cluster
Section  discusses the models used to simulate the ar
chitectures as well as the methodology used to evalu
ate each architecture parameters Section  compares
the throughputs of both architectures Section  anal
yses the behavior of the GigaView and workstation
cluster when used as multimedia servers
 GigaView and workstation cluster
In this section we describe the hardware and soft
ware architecture of both the GigaView parallel im
age server and a generic workstation cluster architec
ture section 	 When statements apply to both
the GigaView and workstation cluster architectures
we refer to them under the name 
 the parallel im
age server We introduce the concepts underlying the
MultiDimensional File System which specially sup
ports imaging applications section 	
 GigaView architecture
The Gigaview consists of a server interface processor
connected through communication links to an array
of intelligent disk nodes Figure 	 The server inter
face processor provides the network interface Each
disk node consists of one or more standard disks	
connected through a SCSIII bus to a local disk node
processor The local processors are transputers T
in the current version and T when they be
come available	 They provide both processing power
and communication links The number of links per
transputer is  Data transfers through the links
and data processing by the transputer do not inter
fere 
 data packets transferred through links are writ
ten by DMA direct memory access	 into the pro
cessors memory The disk nodes support disk ac
cess extent caching image part extraction and im
age de	compression Since the transputer supports
context switches in hardware contexts switches can
be executed in a few microseconds and therefore do
not add any noticeable overhead to the main compu
tations
server
interface
processor
standard network
interface
standard
SCSI-2 disks
disk node
local
processors
SCSI-II, ATM, FDDI
optional
crossbar
switch
gigaview
architecture
+ links
Figure  
 GigaView disk architecture
 Workstation cluster architecture
The workstation cluster architecture gure 	 consists
of a single highspeed backplane bus connected to pro
cessors SCSIchannels and main memory The SCSI
channels connect secondary storage devices typically
magnetic disks	 to the backplanebus We assume that
it is possible to transfer data directly from secondary
storage to main memory by DMA
 MultiDimensional File System
In order to access disks in parallel images
are partitioned into rectangular extents The
MultiDimensional File System MDFS	 stores 
dimensional D	 D and D images divided into
D D and D extents respectively and provides
excellent access performance regardless of the size of
the accessed le and of the architecture on which it is
executed Image access performances are heavily inu
enced by how extents are distributed onto a disk array
In a previous publication  we have shown that the
extent size should be between  and  KBytes and
described algorithms to allocate extents eciently on
a disk array
processor local
standard
SCSI-2 disks
SCSI node
SCSI-channel
controllers
to network high-speed backplane bus general-purposeworkstation cluster
processors
memory
global
memory
large
local
buffers
shared
buffers
Figure  
 Workstation cluster architecture
 Architecture Modeling
This section describes the methodology used to model
the architectures Individual components such as
memory disks busses processors are measured ex
perimentally and relevant parameters such a through
put and latency are evaluated Simulation models for
individual component operations e g the time to
transfer a data packet from disk to shared memory	
are created using the measured parameters A system
is modeled as a set of individual components Op
erations on a system e g the GigaView reading a
visualization window	 are specied as a series of indi
vidual component operations Measured systems are
actual systems such as a disk GigaView or a single
disk singleprocessor workstation Simulated systems
are prospective systems such as a disknode disk
GigaView or a processor disk workstation cluster
The simulator derives the system performance for spe
cic stimuli The benet of this approach is the ability
to evaluate accurately architectures consisting of many
processors and disks having varying individual com
ponent performance It allows asking questions such
as 
 how does the processor performance aect the
overall system performance  what is the architecture
bottleneck  what is required from a specic individual
component to reduce the bottleneck
Section  describes a methodology to measure the
actual performance of a systems individual compo
nent Section  species the GigaView and worksta
tion cluster simulation models
 Evaluating individual components
Experience shows that for multimedia applications
consisting essentially of data transfers	 all individ
ual components shared memory local memory trans
puter links	 exhibit a linear behavior That is their
delay depends linearly on the data set size Therefore
two parameters latency and throughput are sucient
to model their behavior using the formula Delay 
Latency
DataSetSize
Throughput
 To evaluate throughput and la
tency of a given operation we plot its delay as a func
tion of the data set size and linearize leastsquare t	
The slope of the linearized curve gives the throughput
The intersection with the DataSetSize  	 vertical
axis gives a measure of the latency
Denitions We consider two software concepts 

process and buer  and two hardware concepts 
 pro
cessor and memory In the following discussion the
word global applies to memory accessible by all pro
cessors in an architecture  the word shared applies to
a buer visible by all processes in a program  and the
word local is applied to the memory resp a buer	
visible by a single processor resp process	 The as
sumptions are that 	 a small local buer ts in the
local processor memory  	 a large local buer ex
ceeds the local memory size and is therefore stored
in global memory  	 a shared buer is always in
global memory These assumptions aim at producing
a simple model of the general memory access behav
ior of a workstation cluster where the hierarchy of
caches of a processor is modeled as local memory and
global memory operations set copy	 are modeled as a
number of backplane bus transfers The test programs
enable us to conrm or invalidate these assumptions
and model the number of backplane bus transfers re
quired by a given global memory operation
Goal In the multimedia application considered for
this contribution all operations consist of data trans
fers 
 reading from disks  transferring data through
the backplane bus to and from main memory  trans
ferring data through transputer links  copying data in
local memory In previous contributions we have mea
sured disk transfers and transputer link transfer rates
The disks are rated at ms latency and MBytessec
The T links are rated at MBytessec and s la
tency Our purpose is to measure the backplane bus
throughput and the local memory throughput of work
station clusters
To evaluate these two parameters the authors
wrote  test functions and deduced from the so ob
tained delay measures the performance parameters
The  functions are 
 a	 ISB initialize a shared
buer using the UNIX memset function  b	 IPCISB
initialize a shared buer memcpy	 allocated using
the IPC mechanism  c	 ISLB initialize small lo
cal buer memset	  d	 ILLB initialize large local
buer memset	  e	 CSLTSB copy small local buer
to shared buer memcpy	  f	 CLLTSB copy large
local buer to shared buer memcpy	  g	 SSLTSB
shue small local buer to shared buer memcpy	
Our assumption is that a memset resp memcpy	 op
eration requires one resp two	 bus transfer The
typical small buer size is KBytes to KBytes
small enough that no data is transferred onto the
backplane bus The size of a large buer is  to
MBytes Test functions are called repeatedly so that
the typical experiment lasts about  sec Our assump
tion is that a largebuer memset operation corre
sponds to one backplanebus datatransfer and the
largebuer memcpy corresponds to two backplane
bus datatransfers
To evaluate the performance parameters of various
architectures we ran the  test functions on single
processor workstations Table  summarizes the re
sults for  UNIX platforms 
 SparcLX station SLX	
SparcServer  S	 Silicon Iris Iris	 and Dec
Station  DEC	 These are aordable worksta
tions with a price in the K range For reference
we give the performance results of the Silicon Chal
lenge Chall	 with one processor
Table  shows the performance for a single
processor workstation running a single process The
numbers in the table represent data transfer rates be
tween dierent parts of the architecture MBytess	
The numbers in the table are accurate within 
That is when reproducing the experiment we get
a deviation in throughput numbers that stays within
 of the values displayed in the table
Assuming that the ILLB initialize large local
buer	 routine measures the backplane bus through
put and that the ISLB initialize small local buer	
routine measures the local memory throughput ta
ble  suggests that the Sparc station LX and the
Sparc Server have approximately the same bus perfor
mance  and MBytess	 and that the SPARC
server  has faster local memory throughput The
Silicon IRIS has a faster bus and higher local mem
ory throughput than both Sparc architectures The
Dec Station has outstanding local memory and bus
throughputs
arch SLX S Iris Dec
Chall
 proc
ISB     
IPCISB     
ISLB     
ILLB     
CLLTSB     
CSLTSB     
SSLTSB     
Table  
 Workstation cluster throughput MBs	
single processor single process	
Comparing the ILLB init large local buer modeled
as one backplane bus transfer	 and the CLLTSB large
local to shared buer modeled as two backplane bus
transfers	 routines we notice that indeed the CLLTSB
throughput is roughly half the ILLB throughput ex
cept for the Silicon IRIS This suggests that indeed
the one and twobackplanetransfer assumptions are
valid for the Sparc and Dec architectures In the Iris
architecture a DMA mechanism may provide direct
memory to memory transfers The SSLTGB test func
tion shue small local buer to shared buer	 shows
that the cost of transferring the data in small pack
ets is high 
 compared with the single packet trans
fer rate copy small local buer to shared buer	 the
throughput is divided by at least a factor of  Com
paring the ISB and IPCISB functions we notice that
the overhead due to the IPC mechanism is at least a
 factor The authors are aware that there are other
mechanisms than IPC to share memory between pro
cesses but the fact remains that there is always an
overhead for shared memory access
For our simulations we assume that the backplane
bus throughput resp local memory bandwidth	 is
equal to the ISB resp CLLTSB	 function through
put We round up the numbers of table  and make
use the numbers of table  Since at the time of pub
lication the T was not yet available we assume
its performance to be  times the T performance
arch T S Iris Dec
backplane   
memory    
Table  
 Workstation cluster throughput MBs	
 Simulation models
Using the parameters measured on singleprocessor
workstations in section  we specify models of two
multiprocessor multidisk architectures 
 the GigaView
architecture using point to point communication be
tween processors and disknodes and the workstation
architecture using a sharedmemoryandbus architec
ture for communication
Reading a visualization window from the GigaView
consists of decomposing a window request into extent
requests As soon as an extent request is generated
by the interface processor it is transferred down the
appropriate transputer link to the disk node where
the extent is located The disknode reads the extent
from the disk into its processing unit memory The
extent is then transferred up a transputer link back
to the interface processor where it is merged with the
other extents to form the visualization window For
all experiments the GigaView model consists of T
transputers local memory throughput of MBytess
link throughput of MBytess	
Reading a visualization window from the worksta
tion cluster consists of decomposing a window request
into extent requests The decomposition is carried out
by one of the workstation cluster processors As soon
as an extent request is generated by the processor it is
transferred down the backplane bus to the SCSI node
where the extent is located The extent is read from
the disk and transferred by direct memory access to
global memory The processor then merges the extent
scanline by scanline into the visualization window lo
cated in global memory This last operation requires
two additional transfers on the backplane bus The
last bus transfer suers from two overheads 
 access
to shared memory and smallpacket transfer We as
sume that the smallpacket transfer overhead to be
a factor of  the ratio between the throughputs of
the CLLTSB and SSLTSB functions	 and the over
head of shared memory access to be an additional fac
tor of  The workstation cluster model is based
on the DEC performance parameters backplane bus
throughput of MBytess local memory throughput
of MBytess	
The same disks are used in both architectures The
disks are GByte IBM disks rated at msec seektime
and MBytess throughput The following section an
alyzes the GigaView and workstation cluster behavior
when used as multimedia servers
 Architecture throughput
Simulations show that it is possible to describe the
behavior of a parallel storage server using two num
bers latency and throughput Figure  shows how the
throughput evolves as disks are added to each archi
tecture The two architectures all have  disknodes
Disknodes consist of one processor with    or 
disks
0 0.5 1 1.5 2 2.5 3
0.05
0.1
0.15
0.2
0.25
delay (sec.)
 Workstation
 4d thr = 8.32 MB/s
 8+d thr = 15.5 MB/s
 (dotted lines)
 GigaView
 4-disk thr = 8.316 MBytes/sec.
 8-disk thr = 16.36 MBytes/sec.
12-disk thr = 23.62 MBytes/sec.
16-disk thr = 30.37 MBytes/sec.
visualization window size (MBytes)
Figure  
 GigaView vs Workstation throughput
For the Tbased architecture the disks are the
busiest components for up to  disks in the architec
ture For a disk architecture the disk resp links
local processor interface processor	 utilization for a
MBytes visualization window request is  resp
  	 Above  disks the server interface
processor is more utilized than the disks In the case
of the workstation architecture the curves are super
imposed for all architectures with more than  disks
gure  dotted lines	 This indicates that the per
formance is limited not by the disk throughput but by
another component Analysis of the utilization data
indicates that the backplane bus is indeed the bottle
neck With a bus rated at MBytess the application
throughput is limited at MBytess or a fth of the
backplane bus throughput
 Multimedia servers
This section studies both the GigaView and the work
station cluster in terms of delay and delay jitter when
their load consists of one or more multimedia streams
The purpose of the analysis is to ensure that it is pos
sible to make both architectures a source node in a
realtime channel  In other words assuming that
a channel originating from or terminating at the par
allel image server has been requested and established
we try to establish whether a parallel architecture can
guarantee a bounded delay for each frame in the chan
nel
Section  describes the experimental setup Sec
tion  analyzes the image servers behavior for a sin
gle user reading frames allocated on multiple disks
Section  analyzes the image servers behavior for
multiple users reading frames allocated on multiple
disks
 Experimental setup
During an experiment the parallel image server sup
plies one or more streams each dened by a request
pattern By default a request pattern spans one sec
ond and consists of several individual frame requests
distributed over the onesecond interval To test the
behavior of the parallel image servers under various
loads the request pattern is scaled using a factor
called the timeslice The onesecond timeslice cor
responds exactly to the request pattern described at
the beginning of each experiment report Experiments
show that the utilization varies linearly with the in
verse of the timeslice duration Each experiment con
sists of simulating the disk architecture for approxi
mately  timeslices A histogram of frame delays
is gathered for each stream supplied by the parallel im
age server and scaled so as to represent a probability
distribution
All experiments consist of reading as opposed to
writing	 streams The user requests a stream from
the image server and the image server schedules each
frame request There is no jitter in the time of each
frame request since the frame requests are generated
internally The results are presented in terms of de
lay probability distribution pd	 and delay cumulative
probability distribution cpd	 In gures where both
the delay probability distribution and the delay cu
mulative probability distribution are shown only the
cumulative probability distribution scale cpd going
from  to 	 is shown on the yaxis
In this set of experiments both architectures con
sist of  storage nodes each storage node including
two disks We compare a Tbased GigaView ar
chitecture and a DECbased workstation cluster ar
chitecture The experiments reported in this section
describe the behavior of the image servers in uncom
pressed fullframe accessmode The fullframe access
mode consists of accessing all extents making up an
image stored on the GigaView This is the usual
accessmode for multimedia streams Frames in a
stream are KBytes in size For reference a studio
quality TV singleframe image consists of by
byte pixels or KBytes Each frame is segmented
into  extents distributed on all disks of the archi
tecture We show results for single and multiple users
requesting streams of frames distributed over multiple
disks
 Single user
In this experiment the image server supplies one
stream The onesecond timeslice request pattern
consists of  uniformly distributed frame requests
The following two paragraphs compare the GigaView
and workstation cluster architecture for a  framess
load corresponding to a ms timeslice
Access delays For the Gigaview resp worksta
tion cluster	  framess corresponds to a  resp
	 utilization The shaded areas in gure  repre
sent probability distribution and the continuous lines
cumulative probability distribution cpd	
0.04 0.06 0.08
s
0.2
0.4
0.6
0.8
1.
cpd
GigaView
(T9000)
32f/s
70% util.
mean=47.9ms
sdev=4.85ms
0.04 0.06 0.08
s
0.2
0.4
0.6
0.8
1.
cpd
WorkSt.
(DEC)
32f/s
85% util.
mean=48.7ms
sdev=3.32ms
Figure  
 Singlestream accesstime distribution
The two architectures have similar delays 
 the work
station cluster fast processor makes up for its rela
tively slow bus The comparison of the GigaView and
the workstation cluster yields a rather unintuitive re
sult 
 the two architectures have the same delay but
the workstation cluster has the smaller delay jitter To
any user of a workstation with unpredictable response
time this comes as a surprise The explanation comes
from the fact that the workstation cluster bus is a bot
tleneck All bus requests are therefore delayed and
hide the jitter due to the disks
Delay distribution Figure  presents cumulative
probability distributions cpd	 of accessdelays for uti
lizations ranging from  to  Each curve on the
gure represents the cumulative probability distribu
tion for a given utilization For throughputs up to 
framess all cpd curves are similar
0.04 0.06
s
0.2
0.4
0.6
0.8
1.
cpd
<20f/s
32f/s
34f/s
36f/s
38f/s
40f/s
frame
rate
GigaView
(T9000)
0.04 0.06
s
0.2
0.4
0.6
0.8
1.
cpd
<20f/s
32f/s
34f/s
36f/s
frame
rate
Workstation
(DEC)
Figure  
 cpd vs delay and utilization for a single
stream
The workstation architectures has a small delay jit
ter but is unable to sustain throughputs above 
framess MBytess	 Above  framess the Gi
gaView architecture is slowed down by the memory
throughput of its server interface processor Replacing
the T by the faster alpha processor would allow
the GigaView architecture to sustain throughputs of
up to framess MBytess	
 Multiple users
In this experiment the image server supplies three
streams The onesecond timeslice requestpattern
of stream one respective two and three	 consists of
 respective  and 	 uniformly distributed frame re
quests The onesecond timeslice utilization is 
resp 	 for the GigaView resp workstation
cluster	 To simulate the worst case the three request
patterns start at exactly the same time which causes
the occurrence of three simultaneous requests for ev
ery timeslice
Figure  shows the access delay distribution of the
three combined streams for both architectures The
throughput is  framess i e MBytess for a
time slice of ms In this experiment the GigaView
resp workstation cluster	 utilization is  resp
	 Stream interactions more than double the max
imum accessdelay bringing it to ms compared to
the single stream accessdelay performance of ms
The multipleuser analysis suggests that streams
with dierent frame rates strongly aect the delay jit
ter If absolute delay is of importance and buer
ing is not an alternative it is worthwhile considering
whether to constrain frame rates on a parallel image
server shared between multiple users to a basic frame
rate or an integer fraction of it
0.05 0.075 0.1
s
0.2
0.4
0.6
0.8
1.
cpd
GigaView
60% util.
27 fr./s
mean=52.0ms
sdev=13.9ms
0.05 0.075 0.1
s
0.2
0.4
0.6
0.8
1.
cpd
WorkSt.
70% util.
27 fr./s
mean=55.6ms
sdev=14.5ms
Figure  
 Delay distribution for multiple streams
 Conclusion
This contribution compares the image and multimedia
performance behavior of a sharedmemoryandbus
based multiprocessor multidisk MPMD	 workstation
cluster with that of an MPMD architecture having
processortoprocessor communication channels Gi
gaView	 instead of global memory or buses Image
window visualization requires reading image extents
from disks to the processors local memory sending
them to the server interface processor and merging
them into a single visualization window Since in the
GigaView architecture local disk node processors in
dependently read extents from their disks no shared
resources are required for these operations The only
resource where processing needs to be carried out se
quentially is the image part merging process running
on the server interface processor With the worksta
tion cluster architecture however shared resources are
used for nearly every operation
 reading from the disks
requires copying blocks from the IO channel to global
memory and from there to the processor caches Image
extents need to be transferred through the shared bus
to global memory where they become merged into the
desired visualization window Experimentation and
simulations show that the shared bus is the worksta
tion cluster servers bottleneck In order to achieve
a given throughput at the user level ve times that
throughput is necessary at the shared bus level How
ever the GigaView architecture needs to sustain only
a fraction of the user throughput at the level of the
disk node processors At the server interface proces
sor level a local memory throughput three times as
large as the userlevel throughput is sucient in or
der to receive extents and merge them into a single
visualization window We can therefore conclude that
workstation cluster architectures do not perform well
for pixmap image access tasks and that the same per
formance can be obtained at a much lower price with
a GigaView architecture based on point to point com
munication between processors
Regarding the multimedia performance of both ar
chitectures a dedicated workstation cluster architec
ture having a MBytess bus throughput and serv
ing only a single set of requests at the time no task
switches	 oers due to the balancing eect of its
shared bus a lower delay jitter than the GigaView
architecture Nevertheless the total access delay is
slightly longer and its utilization rate higher than the
corresponding parameters of the GigaView architec
ture
In the case of multiple streams having dierent
noncommensurable access rates stream interactions
more than double the maximum access delay and are
responsible for a delay jitter which is much longer than
the mean access delay Contentions between streams
may be reduced by introducing a single basic frame
rate for all stream requests and possibly integer sub
frame rates for lower throughput streams By appro
priately sequencing such multiple frame requests one
would obtain delay jitters close to those of single frame
requests
References
 Ann L	 Drapeau et al	 Raid
II  A high
bandwidth
network le server	 In Proc th Int Symp Computer
Architecture pages  Chicago Illinois 	
 R	 D	 Hersch B	 Krummenacher and L	 Landron	 Par

allel pixmap image storage and retrieval	 In Grebe
et al	 editor Proceedings of the World Transputer
Congress pages 	 IOS Press 	
 R	 D	 Hersch	 Parallel storage and retrieval of pixmap
images	 In Proceedings of the th IEEE Symposium on
Mass Storage System pages  Monterey 	
 D	 Ferrari and D	 C	 Verma	 A scheme for real

time channel establishment in wide
area networks	
IEEE Journal on Selected Areas in Communications
 April 	
