Spatial and temporal data parallelization of the H.261 video coding algorithm by Yung, NHC & Leung, KK
Title Spatial and temporal data parallelization of the H.261 videocoding algorithm
Author(s) Yung, NHC; Leung, KK
Citation Ieee Transactions On Circuits And Systems For VideoTechnology, 2001, v. 11 n. 1, p. 91-104
Issued Date 2001
URL http://hdl.handle.net/10722/42871
Rights
©2001 IEEE. Personal use of this material is permitted. However,
permission to reprint/republish this material for advertising or
promotional purposes or for creating new collective works for
resale or redistribution to servers or lists, or to reuse any
copyrighted component of this work in other works must be
obtained from the IEEE.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 1, JANUARY 2001 91
Spatial and Temporal Data Parallelization of the
H.261 Video Coding Algorithm
Nelson H. C. Yung, Senior Member, IEEE, and Kwong-Keung Leung, Student Member, IEEE
Abstract—In this paper, the parallelization of the H.261 video
coding algorithm on the IBM SP2® multiprocessor system is
described. The effect of parallelizing computations and com-
munications in the spatial, temporal, and both spatial-temporal
domains are considered through the study of frame rate, speedup,
and implementation efficiency, which are modeled and measured
with respect to the number of nodes ( ) and parallel methods
used. Four parallel algorithms were developed, of which the first
two exploited the spatial parallelism in each frame, and the last
two exploited both the temporal and spatial parallelism over a
sequence of frames. The two spatial algorithms differ in that one
utilizes a single communication master, while the other attempts
to distribute communications across three masters. On the other
hand, the spatial-temporal algorithms use a pipeline structure for
exploiting the temporal parallelism together with either a single
master or multiple masters. The best median speedup (frame rate)
achieved was close to 15 [15 frames per second (fps)] for 352 240
video on 24 nodes, and 13 (37 fps) for QCIF video, by the spatial
algorithm with distributed communications. For 10, the
single-master spatial algorithm performs better with efficiency up
to 90%, while the multiple-master spatial algorithm is superior
for 10, with efficiency up to 70%. The spatial-temporal
algorithms achieved average speedup performance, but are most
scalable for large .
Index Terms—Domain decomposition, spatial parallelism,
speedup, temporal parallelism, video coding.
I. INTRODUCTION
ACTIVITIES in video coding can be traced back to over30 years ago when analog techniques were primarily used
for reducing video transmission bandwidth. In today’s terms,
the ways in which the video sequence is coded still determine
much of the cost for storage and transmission. Among the va-
riety of coding techniques and methods developed so far, inter-
national standards such as the H.261 [1], H.263 [2], MPEG-1
[3], MPEG-2 [4], and MPEG-4 [5] have given a clear direc-
tion to encoder/decoder implementation. They in turn fueled the
rapid expansion of applications into multimedia computing, in-
formation storage, video phone/conferencing, remote sensing,
medical imaging and many other audiovisual services [6]–[8].
As the application coverage of these standards increases,
it becomes apparent that real-time coding (24–30 fps) is
Manuscript received November 5, 1997; revised June 1, 2000. This work was
supported by Texas Instruments Tsukuba Research and Development Center,
Japan, and by the University Grants Committee, Hong Kong, Area of Excel-
lence in Information Technology, under Grant AOE98/99.EG01. This paper was
recommended by Associate Editor R. Stevenson.
The authors are with the Department of Electrical and Electronic Engi-
neering, the University of Hong Kong, Hong Kong, SAR, China (e-mail:
nyung@eee.hku.hk; kkleung@eee.hku.hk).
Publisher Item Identifier S 1051-8215(01)00671-1.
essential. However, due to the complex nature of coding itself,
the real-time performance for a standard frame size has not
been reached using our current single processor technology.
Although many researchers have opted to consider new and
faster coding techniques [9]–[11], there have been several
attempts at parallel implementation of the current coding
standards. For instance, Sijstermans and Van der Meer im-
plemented an MPEG-1 encoder for CD-I production using a
POOMA with 100 M68020 processors [12]. They achieved a
measured speedup of 32 for an image sequence, or
an equivalent of 0.5 fps. Motion estimation was parallelized
temporally, in which the video sequence was assigned to a
set of nodes, with the rest of the coding process parallelized
spatially by partitioning each frame into subsets of consecutive
blocks. With this scheme, the scalability was dependent on the
allocation of nodes to various coding functions, which was
only determined experimentally. Akramullah, Ahmad, and
Liou achieved real time performance of MPEG-2 coding on
the Intel Paragon® XP/S using purely spatial partitioning and
CIF format, an equivalent of a speedup of 128 on 330 nodes
[13], [14]. The frame data was evenly partitioned by a two
dimensional grid before mapping to the processors. The distri-
bution of input frame data was such that the processors were
allocated overlapped frame data so as to reduce inter-processor
communication during motion estimation. But there was no
indication of how the reconstructed data was handled. Also,
in their calculation of coding delay, the communication delay
between processors was not taken into account. For paralleliza-
tion, this overhead cannot be neglected since inter-processor
communication time can be substantial, largely determined by
the number of nodes and the interconnection network used.
Shen, Rowe, and Delp achieved 41-fps MPEG compression
of CIF video sequences using 100 nodes of Intel Touchstone
Delta or 144 nodes of Intel Paragon [15]. They used temporal
parallelism in which groups of pictures (GOP) were encoded in
parallel with each node processes a GOP. They highlighted the
fact that data communication accounts for more than half of the
total coding time. They used a custom I/O queue to reduce data
access contention. Although this method achieved real-time
performance, it requires a large number of frames being ready
before the encoding actually starts, e.g., 100 nodes with GOP
of 6 frames requires a total of 600 frames. Therefore, there is
a delay of tens of seconds between the input and output bit
stream. Besides, their temporal parallelism is limited to coding
standards with GOP structure, which is not applicable to H.261.
Shen and Delp further enhanced their method using spatial
parallelism and as a result achieved 33 fps for a CCIR601
test sequence with 512 nodes [16]. In this case, each GOP is
1051–8215/01$10.00 © 2001 IEEE
92 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 1, JANUARY 2001
processed by a group of processors in which each frame is
spatially partitioned into slices before being mapped onto the
group. For input data, the entire GOP is sent to each slave to
reduce inter-processor data dependency and communication.
Further, the output from each group is saved as separate files
that require offline concatenation. While real-time performance
is achieved, there exists a constant delay ranging from 10–30
seconds as before. Similarly, Agi and Jagannathan implemented
an MPEG-1 encoder on a network of workstations and a CM5
system using temporal parallelization based on GOP [19]. They
used the GLU master/worker architecture in which a master
node handles scheduling of function execution to a number of
worker nodes at run time. The system is fault-tolerant in that
the master can detect the failure of workers or entry of new
workers. A speedup of 7.5 on a 12-node cluster of Sun SPARC
2 was achieved, an equivalent of 3 fps, and 4.5 fps was achieved
on a 16-node CM5.
Adopting a more dedicated approach, Akiyama et al. out-
lined a pipelined structure of dedicated digital signal proces-
sors for different stages of the coding computation [17]. Their
simulation showed that computation time for each stage of the
pipeline is within the limit for real-time encoding, but no im-
plementation was given. Bouville et al. developed a flexible
platform based on an array of TMS320C80 processors, where
dedicated buses were used for video input and output and ded-
icated hardwired block-matching chips were used for motion
estimation [18]. They adopted spatial parallelization in which
each TMS 320C80 Multimedia Video Processor (MVP) pro-
cesses a region consisting of horizontal strips of macroblocks
in the frame. Within the MVP, the Master Processor (MP) is
for upper-level bit stream generation and rate control, one Par-
allel Processor (PP) for lower-level bit stream generation, and
the other 3 PPs for macroblock processing. Performance fig-
ures were calculated by counting clock cycles of various coding
modules, where implementation results are yet to be seen.
As observed, most of these implementations exploited spa-
tial or temporal data parallelization experimentally using gen-
eral purpose or dedicated multiprocessors. Speedup figures and
frame rates are usually presented. As realized by much research,
it is crucial that both the spatial- and temporal-domain decom-
position are fully exploited and the eventual parallel algorithm
matches the parallel architecture [21]–[23]. Therefore, it is the
purpose of this research to investigate the effect of parallelizing
computations and communications in the spatial, temporal, and
both spatial-temporal domains through the study of frame rate,
speedup, and implementation efficiency. The whole investiga-
tion was focused on the H.261 coding standard because it has
a reasonable degree of complexity, data dependency, commu-
nication constraints, and represents the basis of the four stan-
dards mentioned earlier. It was implemented on an IBM SP2®
supercomputer, where a methodical approach has been adopted
to model, predict and measure the execution time, computation
time, communication time, idle time, speedup, frame rate, and
efficiency with respect to the parallelization methods used and
the number of processors (nodes) employed. This investigation
leads to the development of four parallel algorithms. The first
two exploited the spatial data parallelism in each video frame, of
which one utilizes a single communication master, and the other
attempts to distribute communications across three masters. To
test the algorithms, two video formats were used:
and QCIF, as both are commonly tested by other researchers
and supported by the H.261 software encoder. A speedup (frame
rate) of close to 15 (15 fps) was achieved for video
on 24 nodes, and 13 (37 fps) for QCIF video. The single-master
spatial algorithm performs better for , with efficiency
ranges from 60% to 90%, while the multiple-master spatial al-
gorithm is superior for , with efficiency ranges from 60%
to 70%. The latter two algorithms exploited both the temporal
and spatial data parallelism, with the pure temporal case given
as a special case. They take into account inter-frame data de-
pendency and exploit temporal parallelism by a pipeline coding
structure, while spatial parallelism is implemented when coding
a frame. A speedup (frame rate) of around 14 (14 fps) was
achieved for video on 23 nodes, and 12.5 (35.7 fps) for
QCIF. Their implementation efficiencies vary within 50%–60%
for a wide range of .
The organization of this paper is as follows. Section II out-
lines the sub-functions of the H.261 coding standard, and in-
troduces a serial version and its delays measured on the IBM
SP2® multiprocessor. Section III discusses the parallel method-
ology and preliminary considerations before developing the al-
gorithms. Section IV presents the method used for parallelizing
the encoder spatially. Section V describes the parallelization of
the encoder in both the spatial and temporal domains. Section VI
discusses and analyzes the measured results. This paper is con-
cluded in Section VII.
II. H.261 ENCODING ALGORITHM
A. Functional Blocks
The H.261 encoder is a hybrid of inter-picture prediction and
transform coding for reducing temporal and spatial redundancy
of the video respectively. The functional architecture of the
coding algorithm is depicted in Fig. 1, where the major com-
ponents are the motion estimation/compensation (ME/MC),
discrete cosine transform (DCT), quantization (QUANT),
zigzag arrangement (ZIGZAG), and variable length entropy
coding (VLC). The decoding part basically consists of a re-
ceiving buffer, a VLC decoder, and performs the inverse of the
following: ZIGZAG, QUANT, DCT, and MC. For simplicity,
the coding control unit is not shown.
In the H.261, each frame is divided into a number of mac-
roblocks (MBs) of size pixels, which is the basic data
unit for motion compensation. An MB is INTER coded if there
is motion compensation; otherwise, it is INTRA coded. For
INTER-coded MBs, the pixels are coded as motion-compen-
sated residues after subtracting by the pixels in the reference
frame. Since the residues are usually smaller in magnitude than
the pixels themselves, data compression is achieved by coding
the residues and the motion vector. Motion estimation is the
process of searching for the closest MB from the reference
frame within the search area, which is a bounded and enlarged
area with a spatial position offset from the position of the
currently coded MB. After motion compensation, the frame of
pixels or residues is coded to reduce spatial data redundancy.
The frame is divided into blocks of pixels, aligned with
YUNG AND LEUNG: SPATIAL AND TEMPORAL DATA PARALLELIZATION OF THE H.261 VIDEO CODING ALGORITHM 93
Fig. 1. Functional block diagram of an H.261 encoder.
the MB boundary for transform coding, quantization, zigzag
traversal, and variable-length coding. After encoding, the
frame is decoded and referenced by the next frame for motion
estimation and compensation.
B. Serial H.261 Encoding Performance
The IBM SP2® system employs a purely distributed memory
architecture in which inter-processor communication is per-
formed via either the Ethernet or the High Performance Switch
(HPS) through message passing [24]–[26]. The platform used
for this investigation is a 32-node system at the University of
Hong Kong. Of the 32 nodes, 24 can be used exclusively by an
application within a limited time window. Each node consists
of a 66.7 MHz POWER2® RISC processor with 64-MB main
memory and a 2-GB disk storage, providing a peak performance
of 266 MFLOPS. The HPS is a bi-directional multistage inter-
connection network that allows simultaneous message passing
between different pairs of nodes. The measured inter-processor
communication bandwidth and latency are 31 MB/s and 14 ms,
respectively, using the U.S. protocol, on the AIX 4.2 operating
system. The tools available for parallelization are the parallel
operating environment, MPL, PVM, MPI, C/C++ compilers,
Fortran and HPF. A LoadLeveler is also available for resource
allocation and queuing of applications [27].
The software H.261 video encoder used is the PVRG-P64
Codec from the Portable Video Research Group at Stanford
[28]. This codec accepts several image formats, such as CIF
, QCIF , and . A number
of parameters can be altered including frame rate, range of mo-
tion estimation search window, frame skip index, and quanti-
zation value. Rate control can be performed at the frame level
in which the frame skip index (default , i.e., no frame skip)
can be varied to drop frames on demand. For motion estimation,
the full search scheme is used in the search window. The max-
imum search window range is pixels, while the default is
Fig. 2. Percentage encoding time break down.
. For each MB in the search area, the Sum-Absolute-Differ-
ence (SAD) is used for similarity comparison. The encoder has
a method to speed up the SAD evaluation that if the sub-sum of
the absolute difference exceeds the minimum SAD found so far,
then the search position is skipped without further evaluating the
complete SAD. Thus the time taken for motion estimation is de-
pendent on the data content (motion).
The above serial program was implemented on an SP2 node
and timing figures were taken using gettimeofday functions
enclosing major functional blocks. The program was compiled
with optimization options “-O3” and “-qarch pwr2”, and
some of its codes were also optimized manually [29]. It was
executed through the LoadLeveler for exclusive use of the node
at run-time. For encoding, the input video was 39 frames of the
“table tennis” sequence in resolution. Each frame has
three files corresponding to the Y, Cr, and Cb components.
Fig. 2 depicts the median percentage time of each block over
3–5 runs for all 39 frames. As expected, motion estimation
contributes almost 40% of the encoding delay. The DCT and
VLC contribute around 7%. The IDCT and QUANT contribute
around 5%, whereas the rest are 2%–4%. An extra function
shown here is STAT, which calculates the coding error statistics.
The average frame time measured is approximately 0.9882 s.
III. METHODOLOGY AND PRELIMINARY CONSIDERATIONS
A. Parallelization Methodology
Conceptually, the parallelization methodology adopted in this
research consists of four steps: partitioning, communication, ag-
glomeration, and mapping [30], [31]. In the first step, parti-
tioning can be on data (domain decomposition), or on compu-
tation (functional decomposition). The step communication de-
termines the communication required to coordinate task execu-
tion, defining appropriate channels and specifying the messages
that are to be sent and received. The step agglomeration evalu-
ates the task and communication structures defined in the first
two steps with respect to performance requirements and the im-
plementation platform, aiming to reduce communication costs
by increasing granularity while keeping flexibility of the paral-
lelization. Lastly, mapping assigns each task to a processor. The
94 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 1, JANUARY 2001
Fig. 3. Inter-frame data dependency.
goal of a mapping scheme is to minimize total execution delay,
where two guiding principles are normally observed: concurrent
tasks are executed on different processors, and tasks that com-
municate frequently are executed on the same processor. In this
research, domain decomposition is chosen, in which the input
frame data or frames are partitioned into a number of units and
are mapped to the processors for computation, while the com-
putations performed by each processor are identical.
B. Data Partitioning Issues
For spatial data partitioning, a unit can be as small as a pixel,
although such fine grain partitioning introduces a huge amount
of communications. As an MB is an inseparable unit in coding,
it seems logical to consider an MB as the base unit rather than
smaller. If a unit is larger than an MB, the degree of data paral-
lelism will be unavoidably reduced too. Therefore, if an MB is
treated as the smallest data unit, we still have plenty of rooms
for parallelization as most picture formats contain several hun-
dreds of MBs. In general, if the MBs are evenly partitioned, for
a frame containing MBs and the system having processing
nodes, each node would be allocated MBs. However, the
MBs may be partitioned unevenly based on workload balancing
criteria, if so desired [32]. The algorithms presented in this paper
employ even partitioning without workload balancing.
C. Data Dependency Issues
There are two types of data dependencies in MB processing.
First, in the temporal direction, data dependency exists between
successive frames while performing ME, MC and IMC, where
references to the decoded MBs in the search area in the pre-
vious frame are required, as depicted in Fig. 3. Second, there
is data dependency between neighboring MBs in a frame due
to MB header fields being coded with differential coding. This
makes the VLC of MB headers inherently serial and has to be
performed sequentially over the MBs in the frame. Apart from
these, all the other steps can be executed independently for each
MB. The algorithms presented here attempt to tackle these de-
pendencies so as to reduce communications.
D. Communication Issues
To encode MBs in a node, there are three types of data com-
munication: distribution of input MB, collection/distribution of
decoded MBs, and collection of encoded MBs and statistics. As-
sume the input data is acquired from a video source or a set of
Fig. 4. S1 algorithm.
files, it has to be distributed to the other nodes for coding. Sim-
ilarly, the decoded data needs to be collected and re-distributed,
while the encoded bitstream and statistics need to be collected
and output. For the S1 algorithm, a node is chosen for doing all
three communications. For the S2 algorithm, these communi-
cations are shared between three masters. In general, inter-pro-
cessor communications are required for all three cases, which
form a substantial overhead affecting the overall execution time
and the scalability of the implementation.
IV. PARALLELIZATION IN THE SPATIAL DOMAIN
A. Spatial Parallel Algorithm-S1
After considering the issues discussed in the preceding sec-
tion, one of the parallel algorithm developed is a purely spatial
parallel algorithm (S1), where a master is designated for han-
dling the communication and ordering of MBs, and the slaves
are responsible for the computation [31]. The data partitioning
scheme is such that the MBs of the frame are arranged in the
order as the coded bit stream. The sorted list of MBs is divided
into segments of MBs for allocation to the slaves. In this way, the
slaves are able to code the MB headers with differential coding
without referencing each other, except for the first MB allocated
to each slave. However, this can be overcome by having the
slave computes the referenced MB itself. As depicted in Fig. 4,
Master#1 reads the input frame data, reorganizes the array of
pixels into an ordered array of MBs, then distributes them to the
slaves. Master#1 also broadcasts the last decoded frame to the
slaves, which is collected at the last iteration. Upon receiving
the decoded frame and the MBs from the current frame, the
slaves perform encoding in parallel. In this case, the master also
codes MBs where for a
YUNG AND LEUNG: SPATIAL AND TEMPORAL DATA PARALLELIZATION OF THE H.261 VIDEO CODING ALGORITHM 95
Fig. 5. S2 algorithm.
frame consisting of MBs. This number is chosen to keep
Master#1 busy while it is not communicating with the slaves.
Finally, Master#1 collects the statistics, coded bit stream and
the decoded MBs from the slaves. In this algorithm, most of the
computations are carried out in the slaves in parallel, with all the
communications being managed by a master, in which
slaves are assigned
MBs each, while the others get . Although
the number of MBs distributed to each slave is near even, due
to the difference in motion content in each MB and the serial
encoder used, the computing delay for each MB, or a group of
MBs, are different.
B. Spatial Parallel Algorithm-S2
This algorithm aims to spread the serial communications
having three masters separately responsible for handling the
distribution of input MBs, collection of encoded MBs, and the
collection of statistics, as depicted in Fig. 5. Instead of handling
the decoded MB by a master, communication is further reduced
by having these data exchanged locally among the slaves
according to the MB allocation. For example, if each slave is
allocated a row of MBs, then it exchanges decoded MBs with
only two slaves holding the upper and lower neighboring rows.
The MBs in the frame are ordered from top to bottom and left to
right. With this scheme, the VLC of MB headers have substantial
data dependency between those allocated to different slaves.
In order to reduce this sequential component, VLC is further
divided into two functions, namely header-VLC and TC-VLC,
corresponding to the VLC of the MB headers and transformed
coefficients, respectively, which TC-VLC can be parallelized
while header-VLC is performed sequentially in Master#3. The
data flow is as follows: once the input MBs are received by the
slaves from Master#1, coding begins. When coding is com-
pleted, Master#2 collects the statistics, and in parallel, Master#3
collects the TC-VLC results and performs header-VLC before
sending the bit stream to the standard output. In this algorithm,
the masters do not perform any MB coding.
C. Performance Prediction
Let be the number of nodes available for parallelization;
be the internal computation time of node , where
be the size in bytes of an input frame; be
the average size of a coded frame; be the size of a statistics
record; be the asymptotic bandwidth of the communication
channel in second per byte; be the overall startup time of
the communication channel; and be the constants in the
broadcast delay expression [20]. The time for encoding a frame
using the S1 algorithm [31] is given by
(1)
where the first represents the time taken for the master to
send bytes to the slaves; the second represents the
time taken for collecting the VLC results and statistics; the third
represents the computation critical path; the fourth repre-
sents the time taken to collect the decoded data; and the fifth
represents the time taken to broadcast the decoded frame. Also,
represents the time for reading a frame and rearranging the
MB, and represents the time for writing the encoded results
into an output bit stream.
For the S2 algorithm, assume and the statistic collection
time is small compared with and , the frame encoding
time is given by , where
(2)
(3)
where
number of MBs in a column;
delay due to the third master, which consists of the time
taken to collect the TC VLC, and to compile the VLC
header ;
delay due to slave computations
in which the first represents the computation critical path;
the second represents the time taken to send the statistics to
96 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 1, JANUARY 2001
Fig. 6. S1/S2 Predicted frame time.
Fig. 7. S1/S2 Predicted frame rate.
the second master; the third represents the time taken for the
slaves to receive the input MBs; the fourth represents the
time taken to send the VLC result to the third master; and the
fifth represents the time taken for a slave to exchange MBs
with its neighbors. For each neighbor, there is an exchange oper-
ation, which is made up of a send and a receive operation. Since
the number of neighbors equals to , the total number of
messages involved is and the total amount of decoded
data exchanged is .
Figs. 6 and 7 depict the predicted frame time and frame rate
for both algorithms for video. In the calculation, the
value of is is is
is is is is
is is 330 and is 15. is for
S1 because of the MB ordering, and for S2.
V. PARALLELIZATION IN BOTH THE TEMPORAL AND SPATIAL
DOMAINS
A. Parallelization Model and Frame Dependency
To achieve parallelization in the temporal domain, the smallest
temporal unit is a frame. Although coding a frame takes consider-
able time ( fps in this case), the average frame time may be im-
proved if the frames can be coded in a pipeline over a number of
clusters with each frame spatially parallelized on a cluster, where
a cluster is a group of nodes. This model is considered general
because if there is only one node in each cluster, then the paral-
lelization is purely temporal. However, if there is only one cluster
with multiple nodes, then the parallelization is purely spatial.
Fig. 8. Temporal data flow across a number of clusters.
Fig. 8 depicts the data flow across a number of clusters. The
frame dependency is that the decoded frame from one cluster is
needed for ME in the neighboring cluster. Apart from this, the
coding of the next frame can start while the coding of the current
frame is still in progress, as long as a minimum number of MBs
needed for ME/MC/IMC are available then.
B. Spatial-Temporal Parallel Algorithm-ST1
Base on the above model, the ST1 algorithm was developed
with all the functions being parallelized except the header-VLC.
Fig. 9 depicts the algorithmic structure in which each frame
is coded by one of clusters and a global master is respon-
sible for reading/distributing input frames, collecting encoded
results, header-VLC. and writing output bit stream. Inside each
cluster, slave 1 is designated for communicating with the global
master. This method of centralizing parallelization control in
slave 1 simplifies the interface between the clusters and the
global master. Assume nodes in the implementation, if there
are clusters. and each cluster contains nodes, then
. The computation is modeled as time slots such that in
each time slot, the slaves in a cluster process MBs in total.
The MBs are processed starting from the left edge of the frame
and extends to the right. Assume a frame has columns
and rows of MBs. To initiate the coding of the next frame,
the minimum number of MBs required is MBs from the first
column plus MBs from the secnd column. The communi-
cation of decoded MBs is such that each slave of cluster sends
its currently decoded MB to the slaves of cluster and
receives the decoded MBs from cluster .
As illustrated in Fig. 9, there are four types of time slots or
cycles associated with this algorithm. A normal slot con-
YUNG AND LEUNG: SPATIAL AND TEMPORAL DATA PARALLELIZATION OF THE H.261 VIDEO CODING ALGORITHM 97
Fig. 9. ST1 algorithm.
Fig. 10. Time line for ST1.
sists of coding the MB only. An input master slot in-
cludes reading of the next input frame and transferring of the
frame data to a cluster, during which the other clusters may be
performing MB coding. An output master slot includes
the collection of encoded results to slave 1 within a cluster,
transferring the encoded data to the global master and writing
the output bit stream, while other clusters may be performing
MB coding. An I/O master slot includes reading of a
frame, transferring to and from the clusters, writing the output
bit stream and MB coding.
The working of the algorithm begins with reading of a frame
data by the global master and transferring the data to the clus-
ters. Then, Cluster 1 codes MBs using
cycles and the decoded MBs are sent to Cluster
2 as soon as they are available. The coded results are then col-
lected from the slaves to slave 1 within a cluster, and transferred
to the global master. Therefore, at the start of the process, the cy-
cles are mainly , whereas at the end, the cycles are mainly
. The cycles in between are either or . This is
depicted in Fig. 10, in which some clusters begin coding later
than others, and at the end of the sequence, some clusters com-
plete earlier and become idle. While the clusters are filled with
normal or master slots, sustained computation in each cluster is
maintained.
98 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 1, JANUARY 2001
Fig. 11. ST2 algorithm.
C. Spatial-Temporal Parallel Algorithm-ST2
This algorithm was developed for sharing the cluster com-
munication between two global masters. In each cycle, a cluster
processes a column of MBs, where . The MBs are
ordered from top to bottom and then evenly partitioned to the
slaves. A frame requires at least two columns of MBs processed
before the next frame can start. A cluster takes cycles to
code a frame. Denote the th slave of cluster by . Spa-
tially, each processes MBs per frame. At
the end of each time slot, sends all of its decoded MBs
to . Also, sends its first decoded MB to
and its last to . Afterward,
receives the required decoded MBs from
and . Therefore, in each cycle,
for decoded MB communication, there are totally six messages
involving MB of data.
As depicted in Fig. 11, one global master is responsible for
reading the input frame data and distributing them to each slave,
and the other is responsible for collecting the results from the
slaves and writing the output bit stream. There are three dif-
ferent types of cycles in this case. A normal slot performs
MB coding only. A master slot 1 consists of reading
the frame data and collecting results from a cluster, while other
clusters are coding MBs. A master slot 2 consists of
writing the output bit stream, distributing the next frame data
to a cluster, and coding the MBs in other clusters. As there are
two global masters, some of the read/write/transfer/coding cy-
cles can overlap with each other. For example, when master 1
reads the next frame, results can be collected into master 2, as
opposed to ST1, in which the global master communicates with
a single representative in each cluster, the global masters in ST2
communicate with each slave in each cluster directly.
D. Performance Prediction
For the ST1 algorithm, in coding frames, the first frames
require while the last frames require . In be-
tween, there are involving both frame input and
output. The total execution time in coding frames is
given by
(4)
where is the total number of slots given by
(5)
where , and
and , are given by (6) to (9)
(6)
YUNG AND LEUNG: SPATIAL AND TEMPORAL DATA PARALLELIZATION OF THE H.261 VIDEO CODING ALGORITHM 99
Fig. 12. (a) ST1—Predicted frame rate (n = SC + 1). (b) ST2—Predicted frame rate (n = SC + 2).
(7)
(8)
(9)
where
size of a MB in bytes;
time for rearranging the input format of an MB;
minimum number of time slots for a frame before the
next frame can start;
sequential frame time.
The term in (6) represents the time for MB
rearrangement after receiving from slave 1, which is equal to
zero when . From (4), the average frame rate is .
For the ST2 algorithm, occurs only once every frame,
i.e., number of is equal to . Similarly, the number of
is also . The number of normal slots is , where
is the total number of time slots. The total execution time
for frames is given by
(10)
where
equal to 2;
number of columns of MBs in a frame;
,
and
given by (11)–(13).
as
(11)
(12)
(13)
In (11), the first represents the delays in the slaves, where
each cluster processes a column of MBs such that each slave is
allocated MBs. The second represents the commu-
nication of decoded MBs between clusters. In (12), the inner
represents the maximum between data collection time
and MB coding time; the lowest term represents the communi-
cation of decoded MBs and is added to the term since
the slaves that communicate with the masters also participate in
decoded MB communication with other clusters. The term
represents the frame reading time in Master 1, which is done in
parallel with the other slaves. Equation (13) is similar to (12),
except that data collection is replaced by the distribution of input
MBs and is replaced by . The average frame rate is given
by . Fig. 12(a) and (b) depict the predicted frame rate
versus , for different choices of , for both algorithms. The
value of is is is 1.0;
is is is 39; is 22.
100 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 1, JANUARY 2001
Fig. 13. (a) S1 measured frame rate. (b) S1 measured speedup.
Fig. 14. (a) S2 measured frame rate. (b) S2 measured speedup.
VI. RESULTS AND DISCUSSIONS
A. Data Collection Conditions
The H.261 program was compiled using mpcc-O3-qarch
pwr2 [lename.c], using single precision integer format. The
wall-clock time generated by gettimeofday was used to
measure the overall execution duration and individual execu-
tion time per stage, where all the nodes were synchronized
and timed at the start of the execution. Blocking send and
receive were used for all the point-to-point communications
where fixed startup time and constant channel bandwidth were
assumed. The broadcast times were assumed as in Hwang
and Xu [20]. The HPS user space communication protocol
was used to obtain the best performance from the network.
The LoadLeveler was used in run-time to ensure exclusive
use of the nodes. The program code was running on Unix,
which would introduce slight variation in the execution time
and measurement error. The H.261 parameters were kept at
their default values throughout the test. The coded results were
checked byte-by-byte against the serial program output. The
coded output streams were also decoded and displayed for
visual inspection and comparison.
B. Results of the S1 Algorithm
The measured frame rates (maximum, minimum and median)
in fps for nodes ranging from 1 to 24, together with the predicted
values are plotted in Fig. 13(a) and the corresponding speedup
figures are depicted in Fig. 13(b). From Fig. 13, a number of
observations can be made. First, the median frame rate tracks
the predicted values closely for less than 8, beyond which it
drops off gradually with a frame rate of 11.2 fps at 23 nodes.
Such behavior can be explained as when is small, the number
of MBs allocated to each slave is large and therefore computa-
tion dominates. As increases, this number decreases, whereas
the communication overhead and idle time increase. Hence, the
slope of the curve decreases gradually for , and seems
to level off beyond . Second, there is a substantial dif-
ference between the maximum and minimum frame rates mea-
sured. This is due to the way in which the SAD is calculated
as explained in Section II-B , and the parallelization magnifies
the difference. Third, the predicted frame rate is optimistic. This
is unavoidable, as the overheads related to software control and
communication contention have not been accounted for in the
prediction. Fourth, there are ripples in both measured curves for
YUNG AND LEUNG: SPATIAL AND TEMPORAL DATA PARALLELIZATION OF THE H.261 VIDEO CODING ALGORITHM 101
Fig. 15. (a) S1 percentage component time. (b) S2 percentage component time.
Fig. 16. (a) ST1 measured frame rate. (b) ST1 measured speedup.
and , which is likely due to measure-
ment error caused by overhead of the operating system and the
slightly uneven distribution of MBs.
C. Results of the S2 Algorithm
The frame rates measured for four nodes up to 24 nodes
and the predicted values are depicted in Fig. 14(a), whilst the
speedup figures are plotted in Fig. 14(b). From Fig. 14(a), it
can be seen that, first, the lower bound for is 4 (3 masters 1
slave). At this , the frame rate is poor which is because three
out of the four nodes are either communicating or idling, with
only one node doing useful computation. Second, the frame
rate increases linearly up to , and gradually levels off
beyond this point. This can be explained as the multiple-master
configuration is effective up to this point. Third, the predicted
values are again optimistic, for the same reasons as in S1.
Fourth, for nodes, the median speedup is 14.3, which is
27% better than the S1 case. In fact, S2 has poorer speedup for
, but better speedup otherwise. This can be explained
by examining the percentage component times, as shown in
Fig. 15.
Let be the percentage of computation time, be
the percentage of communication time, and percentage of
idle time. For the S1 algorithm, is high for small but
drops to around 53% at . On the other hand,
and are small for small and increase to 25%–30% and
15%–20%, respectively, for . On the contrary, of
the S2 algorithm is only 23% for , but steadily increases
to over 60% at , and drops off to 60% for ,
while is very small for small to about 15% for
. For , it shows an opposite trend: large for small
(75%), and small for large (25%). From these results, it can
be argued that centralized communication would achieve better
speedup for small , while distributed communication would
achieve better speedup for large .
D. Results of the ST1 Algorithm
Fig. 16 depicts the measured median frame rates and speedup
versus , with the nodes per cluster being a variable. The
curve of represents the purely temporal case, whereas
the other curves represent varying degrees of mixed spatial-tem-
poral parallelization. From Fig. 16, we can observe that first, the
102 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 1, JANUARY 2001
Fig. 17. (a) ST2 measured frame rate. (b) ST2 measured speedup.
curves are reasonably straight and bunched up. Upon close in-
spection, it is found that the curve levels off after ,
and the curve shows similar tendency, but for larger .
Such phenomenon agrees with the prediction in Fig. 12(a). Fur-
thermore, cases of larger seem to give poorer performance
than those with smaller , which is not clear in the prediction.
Second, out of the five cases, achieves the best speedup
of 12 at , which is slightly better than the S1 case. This is
not entirely unexpected, as both algorithms centralize commu-
nications onto a single master. However, it should also be noted
that the ST1 algorithm is superior for large and , as indicated
in Fig. 12(a). In theory, over 25 fps can be achieved with
for both and , which is impossible for S1.
E. Results of the ST2 Algorithm
Fig. 17 depicts the measured frame rate and speedup for the
ST2 algorithm. It can be observed that the speedup curves for
various are similar for . For , the speedup
of the pure temporal case remains almost constant
at , while the rest continue to increase to close to 13.5 in
the case of for . This figure is better than the
ST1. Close inspection reveals that the cases with large per-
form better than small , by a small margin. Theoretically, this
margin will increase for large as depicted in Fig. 12(b). Com-
paring Fig. 17(a) with Fig. 12(b), the case levels off at a
measured frame rate of 7.7 for , some 0.5 fps below the
predicted. In general, the measured frame rate or speedup agrees
with the predicted but smaller. This once again highlights the ef-
fect of software overhead and communication contention on the
actual parallel performance. According to the prediction, over
25 fps can be achieved for and , which is much
better than the ST1 case.
F. Summary
Fig. 18 depicts the median speedup for the four algorithms
presented in this paper. It is noted that for , S1 achieves
the best relative speedup, followed by ST1 , ST2
, and S2. This can be explained, as for small , computation
dominates. In this case, S1 has more nodes for computation than
Fig. 18. Comparison of speedup—352 240.
Fig. 19. Comparison of efficiency—352 240.
the others, while the others are less efficient because ST1 has the
local masters, ST2 has two global masters and S2 has three mas-
ters. However, for large , communication dominates, making
it inadequate for one master to handle all the communication.
YUNG AND LEUNG: SPATIAL AND TEMPORAL DATA PARALLELIZATION OF THE H.261 VIDEO CODING ALGORITHM 103
Fig. 20. (a) Frame rate comparison—QCIF video. (b) Speedup comparison—QCIF video.
In those cases, S1 and ST1 have similar performance, whereas
ST2 and S2 are superior because of the multiple global mas-
ters. Between these two, the third master of S2 helps to further
parallelize the communications and therefore achieves higher
speedup.
If we define efficiency as measured speedup %
[20], then Fig. 19 depicts the efficiencies of the four algorithms
versus . It can be seen that the S1 efficiency is between
60%–90% for , which is far better than the other three
algorithms % in this range. Obviously, its speedup
reflects such high efficiency. For , the S1 efficiency
decreases gradually to 44% for . At , their
efficiencies are % % % and
%. On the other hand, their efficiencies peak at
70% at for S2, 58% at for ST1 and 61%
at – for ST2. This implies that disregarding the
actual frame rate achieved in each case, the algorithms are
most efficient using these numbers of nodes. These figures are
indeed comparable with those reported by Sijstermans et al.
[12] (32%), Akramulla et al. [13] (39%) and Agi et al. [19]
(62.5%), as the first two cases used very large . In general,
the trend of the S1 and S2 efficiencies seem to decrease as
further increases. However, the ST1 and ST2 efficiencies seem
to remain between 50%–60% for large .
For smaller problem size, Fig. 20 depicts the frame rates and
speedup for the same video but being reduced to QCIF
. It should also be noted that the curves representing the
ST1 and ST2 algorithms are both of . From Fig. 20(b),
it can be observed that first, for , the relative behavior
of the speedup curves is similar to the case, except
with more prominent features. For large , the differences be-
tween S2, ST1, and ST2 become less obvious, where the S1
curve levels off as in Fig. 18. Second, the S1 algorithm speedup
peaks at 6.7 for , and gradually reduces to 6.1 for ,
which is not seen in Fig. 13(b) because of the large frame size.
From this, we can expect for the video, a peak will
be reached, beyond which there will be no gain in speedup even
if continues to increase. Third, S2, ST1, and ST2 still show
a steadily increasing speedup for up to 24, where the best
speedup (frame rate) achieved are 13 (37 fps), 12.5 (35.7 fps),
and 12 (34.5 fps), respectively. All three algorithms achieved
real time coding (30 fps) at . Fourth, S2, ST1, and ST2
have efficiency between 50%–60% for all , while S1 has good
efficiency for small , but poor efficiency for large .
VII. CONCLUSION
Broadly, the use of domain decomposition for parallelizing
the H.261 video-coding algorithm is a viable approach as
long as issues concerning data partitioning, dependency and
communication are systematically dealt with. This means that
the granularity should be defined, the dependency in both the
spatial and temporal domains should be considered, and their
communication requirement should be analyzed. From these,
a performance model may be established, which can be a
useful metric for gauging the actual performance of the parallel
implementation. From the investigation described in this paper,
a number of specific issues are identified. First, although
a reasonable model can be derived, detailed knowledge in
parallelization overhead, OS overhead and communication
contention will certainly improve its accuracy. Second, for
video coding, a mixed spatial-temporal parallelization repre-
sents a general approach. The purely temporal or spatial cases
can simply be treated as special cases without losing generality.
Third, centralizing communications through a single master
among multiple slaves is a common approach, but it is effective
only when the number of nodes is small. It becomes quite
ineffective as compared with distributing communications
across multiple masters, when the node count is high or
communication is expensive, as in the case of workstation
clusters. In our implementation, the S1 algorithm achieves the
best speedup for , whereas the S2 algorithm is superior
for . The same applies to the ST1 and ST2 algorithms,
achieving slightly lower speedup than S2. Fourth, both the
spatial-temporal algorithms scale well compared with the pure
temporal or spatial cases, and the algorithms using multiple
masters tend to scale a little better with more linear speedup.
Concerning implementation efficiency, the S1 algorithm has
104 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 1, JANUARY 2001
efficiency between 60%–90% for . For ,
the S2 algorithm is more efficient. The efficiency of both the
ST1 and ST2 algorithms consistently range between 50%–60%,
which is fair. According to their models, these two algorithms
have the potential of remaining at this efficiency level for large
, which is unlikely for S1 and S2.
ACKNOWLEDGMENT
The authors would like to express their sincere gratitude to
the Computer Center at the University of Hong Kong for their
technical support and use of their IBM SP2 system.
REFERENCES
[1] ITU-T recommendation H.261: Video codec for audiovisual services at
p  64 kbits, 1990.
[2] ITU-T recommendation H.263: Video coding for low bitrate communi-
cation, 1995.
[3] MPEG-1: Coding of moving pictures and associated audio for digital
storage media at up to about 1.5 Mbit/s, 1993.
[4] MPEG-2: Generic coding of moving pictures and associated audio,
1995.
[5] Overview of the MPEG-4 standard, 1997.
[6] V. Bhaskaran and K. Konstantinides, Image and Video Compression
Standards: Algorithms and Architectures. Norwell, MA: Kluwer,
1995.
[7] B. Furht, S. W. Smoliar, and H. J. Zhang, Video and Image Processing
in Multimedia Systems. Norwell, MA: Kluwer, 1995.
[8] J. L. Mitchell, W. B. Pennebaker, C. E. Fogg, and D. J. LeGall, MPEG
Video Compression Standard. London, U.K.: Chapman and Hall,
1997.
[9] L. Torres and M. Kunt, Video Coding: The Second Generation Ap-
proach. Norwell, MA: Kluwer, 1996.
[10] C. Huang and J. L. Wu, “New Generation of real-time software-based
video codec: Popular video coder II (PVC-II),” IEEE Trans. Consumer
Electron., vol. 42, pp. 963–973, Nov. 1996.
[11] K. Li and H. Yuen, “A high performance image compression technique
for multimedia applications,” IEEE Trans. Consumer Electron., vol. 42,
pp. 239–243, May 1996.
[12] F. Sijstermans and J. Van der Meer, “CD-I full-motion video encoding
on a parallel computer,” Commun. ACM, vol. 34, no. 4, pp. 81–91, 1991.
[13] S. M. Akramullah, I. Ahmad, and M. L. Liou, “A portable and scalable
MPEG-2 video encoder on parallel and distributed computing systems,”
in SPIE Proc. Visual Communications and Image Processsing, 1996, pp.
973–984.
[14] , “A data-parallel approach for real-time MPEG-2 video encoding,”
J. Parall. Distrib. Comput., vol. 20, no. 2, pp. 129–146, Nov. 1, 1995.
[15] K. Shen, L. A. Rowe, and E. J. Delp, “A parallel implementation of an
MPEG1 encoder: Faster than real-time!,” in Proc. SPIE Conf. Digital
Video Compression: Algorithms and Technologies, San Jose, CA, Feb.
5–10, 1995, pp. 407–418.
[16] K. Shen and E. J. Delp, “A spatial-temporal parallel approach for
real-time MPEG video compression,” in Proc. 25th Int. Conf. Parallel
Processing, Bloomingdale, IL, Aug. 13–15, 1996, pp. II100–II107.
[17] T. Akiyama et al., “MPEG-2 video codec using image compression
DSP,” IEEE Trans. Consumer Electron., vol. 40, no. 3, pp. 466–472,
1994.
[18] C. Bouville et al., “DVFLEX: A flexible MPEG real time video codec,”
in Proc. IEEE Int. Conf. Image Processing, vol. II, 1996, pp. 829–832.
ICIP’96.
[19] I. Agi and R. Jagannathan, “A portable fault-tolerant parallel software
MPEG-1 encoder,” in Multimedia Tools and Applications, 1996, vol. 2,
pp. 183–197.
[20] K. Hwang and Z. Xu, “Scalable parallel computers for real-time signal
processing,” IEEE Signal Processing Mag., pp. 50–66, July 1996.
[21] P. Chalermwat et al., “Parallel image processing in heterogeneous com-
puting network systems,” in Proc. IEEE Int. Conf. Image Processing,
vol. II, 1996, pp. 161–164. ICIP’96.
[22] Y. Sorel, “Real-time embedded image processing applications using the
A methodology,” in Proc. IEEE Int. Conf. Image Processing, vol. II,
1996, pp. 145–148. ICIP’96.
[23] S. M. Bhandarkar and H. R. Arabnia, “Parallel computer vision on a
reconfigurable multiprocessor network,” IEEE Trans. Parallel Distrib.
Syst., vol. 8, pp. 292–309, Mar. 1997.
[24] T. Agerwala et al., “SP2 system architecture,” IBM Syst. J., vol. 34, no.
2, pp. 152–184, 1995.
[25] C. B. Stunkel et al., “The SP2 high-performance switch,” IBM Syst. J.,
vol. 34, no. 2, pp. 185–204, 1995.
[26] M. Snir, P. Hochschild, D. D. Frye, and K. J. Gildea, “The communica-
tion software and parallel environment of the IBM SP2,” IBM Syst. J.,
vol. 34, no. 2, pp. 205–221, 1995.
[27] (1994) SP Parallel Programming Workshop—LoadLeveler. Maui
High Performance Computing Center, Maui, HA. [Online]. Available:
http://www.mhpcc.edu/training/workshop/index.html
[28] A. C. Hung, “PVRG-P64 Codec 1.1,” Portable Video Research Group
(PVRG), Stanford Univ., Stanford, CA, 1993.
[29] K. Hwang, Z. Xu, and M. Arakawa, “Benchmark evaluation of the IBM
SP2 for parallel signal processing,” IEEE Trans. Parallel Distrib. Pro-
cessing, vol. 7, pp. 522–536, May 1996.
[30] I. Foster, “Designing and building parallel programs—Concepts
and tools for parallel software engineering,” . Reading, MA: Ad-
dison-Wesley, 1995.
[31] K. K. Leung, “Spatial and Temporal Data Parallelization of the H.261
Video Codec on the IBM SP2 Multiprocessor System,” M.Sc. thesis,
Dept. Elec. Electron. Eng., Univ. Hong Kong, Hong Kong, 1997.
[32] N. H. C. Yung and K. C. Chu, “Fast and parallel video encoding by
workload balancing,” in Proc. IEEE SMC’98, Oct. 1998, pp. 4642–4647.
Nelson H. C. Yung (SM’96) received the B.Sc.
and Ph.D. degrees from the University of New-
castle-Upon-Tyne, Newcastle-Upon-Tyne, U.K., in
1982 and 1985, respectively.
He was a Lecturer at the University of New-
castle-Upon-Tyne from 1985 until 1990, where
he was involved in the research and development
(R&D) of digital image processing and parallel
processing. From 1990 to 1993, he was a Senior
Research Scientist at the Department of Defence,
Australia, where he headed a team on the R&D of
military-grade signal analysis systems. He joined the University of Hong Kong
in late 1993 as an Associate Professor, where he leads a research group in
Digital Image Processing and Intelligent Transportation Systems. He is the
founding Director of the Laboratory for Intelligent Transportation Systems
Research, and has published over 100 research papers. His biography is
published in Marquis’ Who’s Who in the World.
Dr. Yung serves as Reviewer for the IEEE TRANSACTIONS ON SYSTEMS, MAN,
AND CYBERNETICS—PART B, IEEE TRANSACTIONS ON SIGNAL PROCESSING,
IEE SPIE Optical Engineering—Part G, HKIE Proceedings, and the Micropro-
cessors and Microsystems Journal. He is a Chartered Electrical Engineer and
Member of the HKIE and IEE.
Kwong-Keung Leung (S’99) received the B.Sc. and
M.Sc. (with distinction) degrees from the Department
of Electrical and Electronic Engineering, University
of Hong Kong, in 1990 and 1997, respectively, where
he is currently working toward the Ph.D. degree.
His research interests include multiprocessor
scheduling, dynamic load balancing, heterogeneous
computing, and parallelization of multimedia
systems.
