A Reconfigurable Functional Unit for TriMedia/CPU64 - A Case Study by Mihai Sima et al.
A Reconﬁgurable Functional Unit for TriMedia/CPU64.
A Case Study
Mihai Sima
1
2, Sorin Cotofana
1, Stamatis Vassiliadis
1,
Jos T.J. van Eijndhoven
2, and Kees Vissers
3
1 Delft University of Technology, Department of Electrical Engineering,
Mekelweg 4, 2628 CD Delft, The Netherlands,
fM.Sima,S.D.Cotofana,S.Vassiliadis
g@et.tudelft.nl
2 Philips Research Laboratories, Department of Information and Software Technology,
Professor Holstlaan 4, 5656 AA Eindhoven, The Netherlands,
jos.van.eijndhoven@philips.com
3 TriMedia Technologies, Inc., 1840 McCarthy Boulevard, Milpitas, California 95035, U.S.A.,
kees.vissers@trimedia.com
Abstract. The paper presents a case study on augmenting a TriMedia/CPU64
processor with a Reconﬁgurable (FPGA-based) Functional Unit (RFU). We ﬁrst
propose an extension of the TriMedia/CPU64 architecture, which consists of a
RFU and its associated instructions. Then, we address the computation of the
8
￿
8 IDCT on such extended TriMedia, and propose a scheme to implement an
8-point IDCT operation on the RFU. Further, we address the decoding of Vari-
able Length Codes and describe the FPGA implementation of a Variable Length
Decoder (VLD) computing facility. When mapped on an ACEX EP1K100 FPGA
from Altera, our 8-point IDCT exhibits a latency of 16 and a recovery of 2 Tri-
Media cycles, and occupies 42% of the FPGA’s logic array blocks. The proposed
VLD exhibits a latency of 7 TriMedia cycles when mapped on the same FPGA,
and utilizes6of itsembedded array blocks. Byusing the8-point IDCTcomputing
facility, an
8
￿
8 IDCT including all overheads can be computed with the through-
put of 1/32 IDCT/cycle.Also, withthe proposed VLDcomputing facility, asingle
DCT coefﬁcient can be decoded in 11 cycles including all overheads. Simulation
results indicate that by conﬁguring each of the 8-point IDCT and VLD computing
facilities on a different FPGA context, and by activating the contexts as needed,
the augmented TriMedia can perform MPEG macroblock parsing followed up by
a pel reconstruction with an improvement of 20-25% over the standard TriMedia.
1 Introduction
A common issue addressed by computer architects is the range of performance im-
provements that may be achieved by augmenting a general purpose processor with a
reconﬁgurable core. The basic idea of such approach is to exploit both the general pur-
pose processor capability to achieve medium performance for a large class of applica-
tions, and FPGA ﬂexibility to implement application-speciﬁc computations. Thus far
FPGA-augmented processors have predominantly assumed a simple general purpose
core [1–4]. Considering the class of VLIW machines, two general research questions
may be raised:– What are the inﬂuences of reconﬁgurable arrays on the performance of commer-
cially available VLIW processors?
– What are the architectural changes needed for incorporating the reconﬁgurable ar-
ray into the processor core?
In an attempt to answer to these questions,we will present a case study on augment-
ing a TriMedia/CPU64 processorwith a Reconﬁgurable(FPGA-based)FunctionalUnit
(RFU). With such RFU, the user is given the freedom to deﬁne and use any computing
facilitysubjecttotheFPGAsizeandTriMedia/CPU64organization.Inordertoevaluate
the potential performance of the augmented TriMedia/CPU64, we chose a signiﬁcant
chunk of MPEG decoding as benchmark. In particular, since the video data accounts
for more than 80% of the whole MPEG bit stream [5], we considered the parsing of
Variable-Length(VL) coded data at the macroblocklayer followed by a pel reconstruc-
tion procedure as benchmark. That is, all the data elements corresponding to slice and
higher layers are considered as being constants for our experiment.
We decided to provide hardware support for two functions of the selected bench-
mark: 8-point (1-D) Inverse Discrete Cosine Transform (IDCT) and Variable-Length
Decoder (VLD). By developing VHDL code and mapping it with Altera tools, we
evaluated the performance of these FPGA-based functions. Further, a program which
is MPEG-compliant has been written in C, and then compiled, scheduled and ﬁnally
simulated with TriMedia tool-chain. For a typical MPEG string with 10% intra-coded,
70% B-coded, and 20% P-coded macroblocks, we found that the augmented TriMe-
dia/CPU64 can perform macroblock parsing followed up by a pel reconstruction with
an improvement of 20-25 % over the standard TriMedia. Given the fact that TriMe-
dia/CPU64 is a 5 issue-slot VLIW processor with 64-bit datapaths and a very rich mul-
timediainstructionset, suchanimprovementwithinthe targetmediaprocessingdomain
indicates that the hybrid TriMedia/CPU64 + FPGA is a feasible approach.
The paper is organized as follows. For background purposes, we brieﬂy present
several issues concerning MPEG and the FPGA architecture in Section 2. Section 3 de-
scribes the architectural extension of TriMedia/CPU64. Implementation issues related
to 1-DIDCTandVLD computingfacilitiesandtheircorrespondinginstructionsaredis-
cussed in Sections 4 and 5. The
8
￿
8 IDCT and entropy decoder implementations are
then described in Sections 6 and 7. The execution scenario of the chosen benchmarkon
both standard and extendedTriMedia, and experimentalresults are presentedin Section
8. Section 9 completes the paper with some conclusions and closing remarks.
2 Background
Data compression is the reduction of redundancy in data representation, carried out for
decreasing data storage requirements and data communication costs. A typical video
codec system is presented in Figure 1 [6,5]. The lossy source coder performs ﬁltering,
transformation (such as DCT, subband decomposition, or differential pulse-code mod-
ulation), quantization, etc. The output of the source coder still exhibits various kinds of
statistical dependencies. The (lossless) entropy coder exploits the statistical properties
of data and removes the remaining redundancy after the lossy coding.Lossy
Source
Coder Decoder Decoder
Decoder
Lossy
Source Entropy
Lossless Lossless
Entropy
Coder
video
in
Digital Encoder
Channel
Digital
video
out
Fig.1. The block diagram of a generic video codec – adapted from [6,5].
In MPEG, the DCT-Quantizationpair is used as a lossy coding technique.The DCT
algorithm processes the video data in blocks of
8
￿
8, decomposing each block into
a weighted sum of 64 spatial frequencies. At the output of DCT, the data is also or-
ganized in
8
￿
8 blocks of coefﬁcients, each coefﬁcient representing the contribution
of a spatial frequency for the video block being analyzed. Since the human eye can-
not readily perceive high spatial frequency activity, a quantization step is carried out.
The goal is to force as many DCT coefﬁcients as possible to zero, especially those cor-
responding to high spatial frequencies, within the boundaries of the prescribed video
quality. Then, a zig-zag operation transforms the matrix into a vector in which the co-
efﬁcients are ordered from the lowest frequencies (upper-left hand corner of the
8
￿
8
block) to the higher ones (lower-right hand corner of the matrix). Usually, this vector
exhibits large numbers of consecutive zeros. The subsequent compression step is car-
ried out by the entropy coder which consists of two major parts: Run-Length Coder
(RLC) and Variable-Length Coder (VLC). The RLC represents consecutive zeros by
their run lengths. Since not each and every zero is coded, the number of samples is
reduced. The RLC output data are composite words, also referred to as source symbols,
which describe pairs of zero-run lengths and quantized DCT coefﬁcient values. When
all the remaining coefﬁcientsin a vector are zero, they are all coded by the special sym-
bol end-of-block.Variable length coding, also known as Huffman coding, is a mapping
process between source symbols and variable length codewords. The variable length
coder assigns shorter codewordsto frequentlyoccuring source symbols, and vice versa,
so that the average bit rate is reduced. In order to achieve maximum compression, the
coded data is sent through a continuous stream of bits with no speciﬁc guard bit as-
signed to separate between two consecutive symbols. As a result, decoding procedure
must recognize the code length as well as the symbol itself in this case.
Subsequently, we will focus on the MPEG decoding, i.e., on the inverse operation
of MPEG coding. Further, we will brieﬂy present the theoretical backgroundof Inverse
Discrete Cosine Transform (IDCT), entropy decoding, as well as some issues related to
the MPEG standard.
2.1 Inverse Discrete Cosine Transform
The transformation for an N point 1-D IDCT is deﬁned by [7]:
x
i
=
2
N
N
￿
1
X
u
=
0
K
u
X
u
c
o
s
(
2
i
+
1
)
u
￿
2
N
where
X
u are the inputs,
x
i are the outputs, and
K
u
=
p
1
=
2 for
u
=
0,o t h e r w i s e
is
1.F o rM P E G ,a2 - DI D C Tp r o c e s s e sa n
8
￿
8 matrix
X [5]:x
i
;
j
=
1
4
7
X
u
=
0
7
X
v
=
0
K
u
K
v
X
u
;
v
c
o
s
(
2
i
+
1
)
u
￿
1
6
c
o
s
(
2
j
+
1
)
v
￿
1
6
One strategy to compute the 2-D IDCT is the standard row-column separation. The
2-D transform is performed by applying the 1-D transform to each row (horizontal
IDCTs) and subsequently to each column (vertical IDCTs) of the data matrix. This
strategy can be combined with different 1-D IDCT algorithms to further reduce the
computational complexity. One of the most efﬁcient 1-D IDCT algorithms has been
proposed by Loefﬂer [8]. A slightly different version of the Loefﬂer algorithm in which
the
p
2 factorsare movedaroundhas beenproposedby vanEijndhovenandSijstermans
[9]. In our experiment, we will use this modiﬁed algorithm (see Figure 2).
2 C1
2 C3
3
0
1
2
3
4
5
6
7
OUTPUTS
2 C6
1/2
1/2
0
1
2
4
5
6
7
6
0
4
2
7
3
5
1
INPUTS
Fig.2. The modiﬁed ’Loefﬂer’ algorithm – from [9].
In the Figure, the round block signiﬁes a multiplication by
C
0
0
=
p
1
=
2.T h eb u t -
terﬂy block and the associated equations are presented in Figure 3.
I0
1 I
O0
1 O
O
0
=
I
0
+
I
1
O
1
=
I
0
￿
I
1
Fig.3. The butterﬂy – from [8].
A square block depicts a rotation which transformsa pair
[
I
0
;
I
1
]into
[
O
0
;
O
1
].T h e
symbol of a rotator and the associated equations are presented in Figure 4. Although an
implementation of such a rotator with three multiplications and three additions is pos-
sible [8], we use the direct implementation of the rotator with four multiplications and
two additions, since it shortens critical path and improves numerical accuracy. There-
fore,multiplications by constants
C
0
0,
C
0
1,
S
0
1,
C
0
3,
S
0
3,
C
0
6,a n d
S
0
6 have to be carriedout.
For more details regarding this problem, we refer the reader to the bibliography[10].I0
1 I
k Cn O0
1 O
O
0
=
I
0
k
c
o
s
n
￿
1
6
￿
I
1
k
s
i
n
n
￿
1
6
=
C
0
n
I
0
￿
S
0
n
I
1
O
1
=
I
0
k
s
i
n
n
￿
1
6
+
I
1
k
c
o
s
n
￿
1
6
=
S
0
n
I
0
+
C
0
n
I
1
Fig.4. The rotator – [8].
2.2 Entropy Decoder
In MPEG, the entropy decoder consists a Variable-Length Decoder (VLD) followed
by a Run-Length Decoder (RLD). The input to the VLD is the incoming encoded bit
stream, and the output is the decoded symbols. Since the code length of the symbol is
variable, both the input and output bit rate of a VLD cannot be kept constant. Three
different decoder types are possible [6]: constant input rate, constant output rate, and
variable input-outputrate.
The constant-input-rate VLD decodes a ﬁxed number of bits and produces a vari-
able number of symbols per unit time. An example of such decoder which decodes one
bit per cycle is described in [11]. The decoder employs a binary tree search technique
in which a token is propagated in a reverse Huffman tree constructed from the origi-
nal codes. Although some improvements of the tree-based method make it possible to
decode more than one bit per cycle [12], the tree-based approaches are not suitable for
high performance applications such as high-deﬁnition television, because high clock
rate processing is needed.
A constant-output-rate VLD decodes one codeword (symbol) per cycle regardless
of its length [13]. Generally speaking, a constant-output-rate VLD contains a look-up
table which receives the variable-lengthcode itself as the address. The decoded symbol
(run-level pair or end-of-block) and the codeword length are generated in response to
that address. Since the longest codeword excluding Escape has 17 bits, the LUT size
could reach 131072 (
=
2
1
7) words for a direct mapping of all possible codewords.
A variable-input-output-rate VLD is a mixture of the ﬁrst two VLDs. It is imple-
mented as a repeated table look-up, each step decoding a variable size chunk of bits. If
a valid code was encountered,a run/level pair or an end-of-blockis generated. If a miss
is detected, a chunk size for the next look-up is generated. In this way, the short (most
probable) are preferentially decoded. A variable-input-output-rateVLD exhibits an ac-
ceptable decoding throughput,while the size of the look-up table is resonable small.
The run-length decoder passes the VLC-decoded codewordsthrough if they are not
run-length codes, otherwise it outputs the speciﬁed number of zeros.
2.3 Macroblock parsing and pel reconstruction
ThemacroblockparsingprocessreadstheVLcodeddatastringfromwhichallthehead-
ers corresponding to slice and higher layers have been removed, and outputs various
symbols:decodingparametersatthemacroblocklayer(macroblock address increment,
macroblock type, coded block pattern,a n dquantizer scale), motion values,a n dcom-
positesymbols (run/level pairsandend of block).ThedecodingoftheVariable-Length
Codes (VLC) is performed according to a set of VLC tables deﬁned by the MPEGstandard. The motion values are used by a motion compensation process which is not
considered here. However, since these values are decoded during the macroblock pars-
ing, the overhead associated with the decoding of the motion values will be taken into
consideration in the subsequent experiment.
Following the macroblockparsing, a pel reconstruction process recreates
8
￿
8 ma-
trices of pels. The pel reconstruction module is depicted in Figure 5. Its functionality is
as follows. First,
8
￿
8 matrices of DCT quantized coefﬁcients are recreated by a Ma-
trix Reconstruction module. Second, an inverse quantization (InvQ) is performed. An
8
￿
8 quantization table, and a multiplicative quantization factor (quantizer scale)a r e
used in the InvQ process. Third, a DC prediction unit reconstructs the DC coefﬁcient in
intra-coded macroblocks. Finally, an IDCT is performed. In connection with Figure 5
and the subsequent experiment, we would like to mention that the VLC decoder and
IDCT will beneﬁt from reconﬁgurable hardware support.
quantized
coefficients
DCT
8x8 matrix
reconstruction
DC
pred.
reconstructed
pels
quantization table
forward_f
VLC data at
macroblock
layer
coef.
8x8 matrices
of
backward_f
backward_r_size
quantizer_scale
coefficients
AC
IDCT
Inv
Q
DC
forward_r_size
Pel reconstruction Macroblock parsing
VLC
decoder
Intra select
Parameters from higher layers:
picture type
macroblock_address_increment controller
Decoder
Fig.5. Macroblock parsing and pel reconstruction module – adapted from [5].
We conclude this section with a review on the architecture of the FPGA we used as
an experimental reconﬁgurable core.
2.4 The FPGA architecture.
Field-Programmable Gate Arrays (FPGA) [14] are devices which can be conﬁgured in
the ﬁeld by the end user. In a general view, an FPGA is composed of two constituents:
Raw Hardware and Conﬁguration Memory. The function performed by the raw hard-
ware is deﬁned by the information stored into the conﬁguration memory. Generally
speaking, a multiple-context FPGA [15] is an FPGA having the conﬁguration mem-
ory replicated in order to contain several conﬁgurations for the raw hardware. That is,
a multiple-context FPGA contains an on-chip cache of raw hardware conﬁgurations,
which are referred to as contexts. Such a cache allows a context switch to occur on the
order of nanoseconds [16]. However, loading a new conﬁguration from off-chip is still
limited by low off-chip bandwidth.
In the sequel, we will assume that the architecture of the raw hardware is identical
with that ofan ACEX 1KdevicefromAltera[17].Ourchoicecouldallowfuturesingle-chip integration, since both ACEX 1K FPGAs and TriMedia are manufactured in the
same TSMC technological process. Brieﬂy, an ACEX 1K device contains an array of
Logic Cells, each includinga 4-input Look-UpTable (LUT), a relative small number of
Embedded Array Blocks, each EAB being actually a RAM block with 8 inputs and 16
outputs, and an interconnection network. In order to have a general view, we mention
that the logic capacity of the ACEX 1K family ranges from 576 logic cells and 3 EABs
for EP1K10 device to 4992 logic cells and 12 EABs for EP1K100 device. The maxi-
mum operating frequency for synchronous designs mapped on an ACEX 1K FPGA is
180 MHz. More details regarding the architecture and operating modes of ACEX 1K
devices, as well as data sheet parameters can be found in [17].
3 An architectural extension for TriMedia/CPU64
TriMedia/CPU64 is a 64-bit 5 issue-slot VLIW core [18], launching a long instruc-
tion every clock cycle. It has a uniform 64-bit wordsize through all functional units,
the register ﬁle, load/store units, on-chip highway and external memory. Each of the
ﬁve operations in a single instruction can (in principle) read two register arguments
and write one register result. The architecture supports subword parallelism and is op-
timized with respect to media processing. With the exception of ﬂoating point divide
and square root, all functional units have a recovery1 of 1, while their latency2 ranges
from 1 to 4. The TriMedia/CPU64 VLIW core also supports multi-slot operations, or
super-operations. Such a super-operation occupies two neighboring slots in the VLIW
instruction,and mapsto a double-widthfunctionalunit. Thisway, operationswith more
than 2 arguments and/or more than one result are possible.
First we propose that the TriMedia/CPU64 processor is augmented with a Recon-
ﬁgurable Functional Unit (RFU) which consists mainly of a multiple-context FPGA
core. A hardwired Conﬁguration Unit which manages the reconﬁguration of the raw
hardware is associated to the reconﬁgurable functional unit, as it is depicted in Figure
6. The reconﬁgurablefunctionalunit is embeddedinto TriMedia asany otherhardwired
functional unit is, i.e., it receives instructions from the instruction decoder, reads its in-
put argumentsfrom and writes the computedvalues back to the registerﬁle. In this way,
only minimal modiﬁcations of the basic architecture are required.
In order to use the RFU, a kernel of new instructions is needed. This kernel consti-
tutes the extension of the TriMedia/CPU64 instruction set architecture we propose. It
includes the following instructions:
S
E
T
C
O
N
T
E
X
T,
A
C
T
I
V
A
T
E
C
O
N
T
E
X
T,a n d
E
X
E
C
U
T
E.
Loading a context information into the RFU conﬁguration memory is performed under
the command of a
S
E
T
C
O
N
T
E
X
T instruction. The
A
C
T
I
V
A
T
E
C
O
N
T
E
X
T instruction con-
trols the swaping of the active conﬁguration with one of the idle on-chip conﬁguration.
The operations performed by the computing resources conﬁgured on the raw hardware
are launched by
E
X
E
C
U
T
E instructions. In this way, the execution of an RFU-mapped
operation requires three basic stages: set, activate, and execute [19].
The user is given a number of
E
X
E
C
U
T
E instructions which encompass different
operation patterns: single- or double-slot operations, operations with an immediate ar-
1 Minimum number of clock cycles between the issue of successive operations.
2 Clock cycles between the issue of an operation and availability of its results..
.
. CONFIGURATION UNIT
MULTIPLE−CONTEXT FPGA−BASED
CONTEXT
ACTIVATE SET
Resources (Facilities)
Hardwired Configuring Configurable Computing
Resources (Facilities)
RAW HARDWARE
RECONFIGURABLE FUNCTIONAL UNIT
CONFIGURATION
active configuration
Determines the
MEMORY
EXECUTE
CONTEXT
Fig.6. The organization of the RFU and associated conﬁguration unit.
gument, etc. It is the responsibility of the user to choose the appropriate
E
X
E
C
U
T
E in-
struction corresponding to the pattern of the operation to be executed. At the source
code level, this may be done setting up an alias, as it is described subsequently. Since
the
E
X
E
C
U
T
E instructions are executed on the RFU without checking of the active con-
ﬁguration, it is still the responsibility of the user to perform the management of the
active and idle conﬁgurations.
For the semantics of an operation performed by a computing facility, its latency,
recovery, and slot assignment are all user deﬁnable, the source code of the application
should contain information to augment the Machine Description File [20]. Assuming
for example a user-deﬁned
V
L
D instruction, a way to specify such information is to
annotate the source code as follows:
.alias VLD EXEC 3 ; speciﬁes the alias
E
X
E
C
U
T
E
3
; (super-op with two inputs and outputs)
.latency VLD 7 ; speciﬁes the VLD latency
.recovery VLD 7 ; speciﬁes the VLD recovery
.slot VLD 1+2 ; speciﬁes the slot assignment
; of the VLD instruction
In a similar way, the user can deﬁne as many RFU-related instructions as he/she wants.
The next section will present the sintax and semantics of the 1-D IDCT and VLD
instructions,aswellasimplementationissuesofthecorrespondingcomputingfacilities.
4 1-D IDCT instruction and computing facility
Since the standard TriMedia provides a good support for transposition and matrix stor-
age, we expect to get little beneﬁt if we conﬁgure the entire 2-D IDCT into FPGA.
Our goal is to balance the cost of storing the intermediate 2-D IDCT results into an
FPGA-resident transpose matrix memory against obtaining free slots into TriMedia.
Consequently, only a super-operation computing the 1-D IDCT of eight 16-bit values
packed in two 64-bit registers is considered. The sintax of such operation is:
1-D IDCT Rx, Ry
! Rz, Rwwhere the registers Rx and Ry specify the inputs, and Rz and Rw, the outputs. All
registers Rx, Ry, Rz, and Rw encompass the common format presented in Table 1.
Table 1. 1-D IDCT – The common format of registers Rx, Ry, Rz, and Rw (vec64sh).
Field name Acronym Width Position Type Range Description
(bit) (bit) (TriMedia)
1st value – 16
6
3
:
:
:
4
8 int16 – –
2nd value – 16
4
7
:
:
:
3
2 int16 – –
3rd value – 16
3
1
:
:
:
1
6 int16 – –
4th value – 16
1
5
:
:
:
0 int16 – –
Since there are no dependenciesin computing the 1-D IDCT on each row (column)
of the
8
￿
8 matrix, a pipelined 1-D IDCT is desirable. A recoveryof 1 of such comput-
ing resource implies that the FPGA clock frequency is equal with the TriMedia clock
frequency. Nowadays, the current TriMedia clock frequency is greater than 200 MHz,
while the maximum allowable clock frequency for ACEX 1K is 180 MHz. Therefore,
an 1-D IDCT hypothetical implementation having a recovery of 1 is not a realistic sce-
nario, and a recovery of 2 or more is mandatory for the time being. In the sequel, we
will assume a recovery of 2 for 1-D IDCT and a 200 MHz TriMedia. This implies that
the pipelined implementation of 1-D IDCT will work with 100 MHz clock frequency.
All the operations required to compute 1-D IDCT are implemented using 16-bit
ﬁxed-point arithmetic. Since an implementation of the rotator with four multiplications
is preferred [10], the computation of 1-D IDCT requires
1
4 multiplications. As all the
multiplicationsare to be performedin parallel,an efﬁcient implementationofeach mul-
tiplication is of crucial importance. For all multiplications, the multiplicand is a 16-bit
signed integer represented in 2’s complementnotation, while the multiplier is a positive
integer constant of 15 bits or less. As claimed in [21], these word lengths in connec-
tion with ﬁxed-point arithmetic are sufﬁcient to fulﬁll the IEEE numerical accuracy for
IDCT in MPEG applications [22].
Ageneralmultiplicationschemeforwhichbothmultiplicandandmultiplieroperands
are unknown at the implementation time exhibits the largest ﬂexibility at the expenses
of higher latency and larger area. If one of the operandsis knownat the implementation
time, the ﬂexibility of the general scheme becomes useless, and a customized imple-
mentation of the scheme will lead to improvedlatency and area. A scheme which is op-
timized against one of the operands is referred to as multiplication-by-constant.S i n c e
such a scheme is more appropriate for our application, we will use it subsequently.
To implementthe multiplication-by-constantscheme, we built a partial productma-
trix, where only the rows corresponding to a ‘
1’ in the multiplier operand are ﬁlled in.
Then, reductionschemes which ﬁt into a pipeline stage runningat
1
0
0 MHz are sought.
ItshouldbeemphasizedthatareductionalgorithmwhichisoptimumonacertainFPGA
family may not be optimum for a different family.In connection with the partial product matrix, reduction modules which can run at
1
0
0 MHz when mapped on an ACEX 1K are presented in Figure 7. All the designs
are synchronous,i.e., both inputs and outputs are registered. The estimations have been
obtainedbycompilingVHDLsourcecodeswithLeonardoSpectrumTM fromExemplar,
followed by a place and route procedure performed by MAX+PLUS IITM from Altera.
The
1
0
0 MHz reduction modules are summarized below:
– Horizontal reductions of three, or four 16-bit lines to one line (Fig. 7 – a).
– Horizontal reduction of only two 30-bit lines to one line (Fig. 7 – b).
– Vertical reductions of three or four 7-bit columns to one line (Fig. 7 – c).
– Vertical reductions of six 5- or 6-bit columns to one line (Fig. 7 – d).
31 25 20 15 10 5 0
(a)
31 25 20 15 10 5 0 31 25 20 15 10 5 0
(d)
31 25 20 15 10 5 0
(c)
(b)
Fig.7. 100 MHz reduction modules on ACEX 1K.
We do not go into details about the implementations of the multipliers and we refer
the reader to [10]. We still mention the latency of each multiplier:
￿
C
0
0 latency
=
2,
￿
C
0
1latency
=
3,
￿
S
0
1 latency
=
3,
￿
C
0
3 latency
=
3,
￿
S
0
3 latency
=
3,
￿
C
0
6 latency
=
3 ,
￿
S
0
6latency
=
2 .
The sketch of the 1-D IDCT pipeline is depicted in Figure 8 (the Roman numerals
specify the pipeline stages). Considering the critical path, the latency of the 1-D IDCT
is composed of:
– one TriMedia cycle for reading the input operands from the register ﬁle into the
input ﬂip-ﬂops of the 1-D IDCT computing resource;
– two FPGA cycles for computing the multiplication by constant
C
0
0;
– one FPGA cycle for computing all additions to rotators
p
2
C
1 and
p
2
C
3.
– three FPGA cycles for computing the multiplication by constant
C
0
1;
– one FPGA cycle for computing the additions in the last stage of the transform;
– one TriMedia cycle for writing back the results from the output ﬂip-ﬂops of the
1-D IDCT computing resource into the register ﬁle.
Therefore,the latencyof the 8-point1-DIDCT operationis
1
+
(
2
+
1
+
3
+
1
)
￿
2
+
1
=
1
6 TriMedia cycles. We evaluated that 1-D IDCT uses
4
2
% of the logic elements of an
ACEX EP1K100 device and 257 I/O pins.RD
2x
WR
2x
File
Register
en
pipeline stage
I II III IV V VI VII
2
FPGA
en en en en en en
Enable
FPGA
en en
Register
File
Clock
TriMedia
IDCT
Write−back
Enable
Enable
IDCT
Clock
TriMedia
(multiplication reduction module or
butterfly addition/subtraction)
Fig.8. The 1-D IDCT pipeline
5 VLD instruction and computing facility
As mentionedin Section 3, computingresourceswhich can performrather complexop-
erations are worth to be implemented on the RFU. Also, as with all hardwired comput-
ing resources, the latency of an RFU-conﬁgured computing resource should be known
at compile time. Therefore, we will subsequently consider a VLD instruction which re-
turns a DCT symbol (run/level pair or end-of-block) per execution. That is, a constant-
output-rate VLD is to be employed. With such decoder, no beneﬁts from preferentially
decoding the short (most probable) codewords can be achieved.
Asuper-operationpatternwith twoinput(Rx,Ry)andtwooutput(Rz,Rw) registers
is assigned to the variable-length decoder:
VLD Rx, Ry
! Rz, Rw
Table 2. VLD-1 – The format of the ﬁrst argument (parameter) register – Rx (uint32).
Field name Acronym Width Position Type Range Description
(bit) (bit) (TriMedia)
Decoding parameters
d
e
c
p
a
r
a
m 32
3
1
:
:
:
0 uint32 –
Not used – 27
3
1
:
:
:
5 n.a. n.a.
MPEG standard
m
p
e
g
s 1
4 bit
f
0
;
1
g
=
1for MPEG-2
Intra VLC format
i
v
l
c
f 1
3 bit
f
0
;
1
g
=
0for B14 table
Intra/PB
i
n
t
r
a
p
b 1
2 bit
f
0
;
1
g
=
1for intra macroblock
Luma/Chroma
y
c 1
1 bit
f
0
;
1
g
=
1for luminance
DC/AC Coefﬁcient
d
c
a
c 1
0 bit
f
0
;
1
g
=
1for DC coefﬁcient
The Rx registerspeciﬁes the decodingparameterswhich identifythe type of the symbol
to be decoded: AC/DC, luminance/chrominance,intra/non-intra,as well as whether the
string is an MPEG-1 or MPEG-2 one, or whether the decoding table is B14 or B15
[5]. The second register, Ry, contains 64 bits of the VL compressed data. The decoded
symbol and its code length will be stored into registers Rz and Rw, respectively. SinceTable 3. VLD-1 – The format of the second argument register – Ry (uint64).
Field name Acronym Width Position Type Range Description
(bit) (bit) (TriMedia)
MPEG string – 64
6
3
:
:
:
0 uint64 n.a. The ﬁrst bit of the MPEG string
is the most-signiﬁcant bit
Table 4. VLD-1 – The format of the returned value in register Rz (vec64ub).
Field name Acronym Width Position Type Range Description
(bit) (bit) (TriMedia)
Not used 32
6
3
:
:
:
3
2 any n.a.
Level level 16
3
1
:
:
:
1
6 int16 – Extracted as two uint8.
Run run 8
1
5
:
:
:
8 uint8 –
Code-length code length 8
7
:
:
:
0 uint8 –
Table 5. VLD-1 – The format of the returned value in register Rw (vec64ub).
Field name Acronym Width Position Type Range Description
(bit) (bit) (TriMedia)
Not used – 32
6
3
:
:
:
3
2 any n.a.
Not used – 8
3
1
:
:
:
2
4 uint8 n.a.
Exit controls – 8
2
3
:
:
:
1
6 uint8 –
valid decode valid decode 1 19 bit
f
0
;
1
g
=
1when valid decode
error error 1 18 bit
f
0
;
1
g
=
1when error
EOB EOB 1 17 bit
f
0
;
1
g
=
1when end-of-block
exit ﬂag exit ﬂag 1 1 16 bit
f
0
;
1
g
=
1when exit condition
Not used – 8
1
5
:
:
:
8 uint8 n.a.
Exit ﬂag – 8
7
:
:
:
0 uint8
f
0
;
1
g
exit ﬂag exit 1 1 bit
f
0
;
1
g
=
1exit condition
the VLD does not know the start of the next variable-length codeword until the current
codewordis decoded,a newVLD operationcan be launchedonlyafter the previousone
has completed. Consequently, a recovery lower than the latency gives no advantages,
and such implementationshould not be sought. The formatsof the registers Rx, Ry, Rz,
Rw are shown in Tables 2, 3, 4, and 5.
Generally speaking, a constant-output-rate VLD computes the codeword length by
looking-up the 17 leading bits of the incoming bit stream into a look-up table. The de-
coder then sends the code length and the leading bits to other feed-forwardcircuitry for
further decoding and immediately shifts the input by a number of bits equal with code
length, to prepare the next decoding cycle. In cases where the number of codewords is
large, there are some bits that are common to the long VLC’s, called preﬁx. By exploit-
ing these common preﬁxes, the size of the LUT can be reduced because the preﬁxes are
no longer redundantin the LUT [23,24]. The basic idea of preﬁx precoding is to groupthe VLC’s by their common preﬁxes, and to provide for LUTs, one for each group,
which can decode codewords only in the corresponding group.
Since a single EAB of an ACEX 1K device can implement a lookup table of 8 in-
puts, we partitioned the VLC table according to this FPGA architectural characteristic,
as presented in Table 6.
Table 6. The partitioning of the VLC codes of AC coefﬁcients into groups and classes.
Name of No. of symbols Class / Leading Code length Bypassed Effective address
the group in the class bit-sequence bit-sequence length
DC Group 0 2 1 1+s – n.a.
End-of-block 1 10 2 – n.a.
AC Group 0 2 11 2+s – n.a.
Escape 1 0000 01 6 + 18/(14,22) – n.a.
2 011 3+s 3
4 010 4+s 4
4 0011 5+s 5
Group 1 2 0010 1 5+s 0 5
8 0001 6+s 6
8 0000 1 7+s 7
16 0010 0 8+s 8
16 0000 001 10 + s 5
Group 2 32 0000 0001 12 + s 0000 00 7
32 0000 0000 1 13 + s 8
32 0000 0000 01 14 + s 6
Group 3 32 0000 0000 001 15 + s 0000 0000 0 7
32 0000 0000 0001 16 + s 8
In order to reduce the latency, the implementation of the VLD makes use of ad-
vanced computation. The run and level for each and every group were decoded in par-
allel, as the valid symbol would belong to that group. In parallel, the code length of the
symbol along with some selection signals are determined. Then, the selection of the
proper run and level pair is carried out. The implementation is presented in Figure 9.
Regarding the groups 1, 2, and 3, one, six, and nine leading bits are shifted out
from the original VLC string, respectively. The three resulted strings are each sent to a
differentEAB,andthreerun/levelpairsaregeneratedasiftheshiftedleadingbitswould
have been those mentioned in the columnBypassed header. By means of combinatorial
circuits, the same procedure is carried out for groups 0, end-of-block,and escape.
Each of the leading bit-sequence which deﬁne the VLC class is decoded by a
multiple-input gate. Once the class is detected, a multiplexer will select the proper out-
put from the outputs of EABs, EOB detector, Escape detector, and Group 0 decoding.
The code length of the decoded symbol is generated according to the detected class.
By simulation, we found that the FPGA-based VLD operation exhibits a latency of
7 TriMedia cycles. 6 EABs of an ACEX EP1K100 device are used.DC Group_0
Code Length
"decoder" 5
2
Code Length
"decoder"
EOB
5
2
AC Group_0
Code Length
"decoder" 5
3
Group_3
Code Length
decoder
Group_2
Code Length
decoder
Group_1
Code Length
decoder
Antet_11
Antet_10_EOB
Antet_0000_01_ESC
Antet_011
Antet_010
Antet_1
Antet_0011
Antet_0010_1
Antet_0001
Antet_0000_1
Antet_0010_0
Antet_0000_001
Antet_0000_0001
Antet_0000_0000_1
Antet_0000_0000_01
Antet_0000_0000_001
Antet_0000_0000_0001
Antet
decoder
Group_3_CL
Group_2_CL
Group_1_CL
ESC_CL
DC_Group_0_CL
EOB_CL
AC_Group_0_CL
Intra_DC_CL 5
R
L
CL
CL
L
R
ESC_L
ESC_R
y_c
mpeg_s
mpeg_s
VLC string
VLC string
VLC string
VLC string
Intra_DC_Size
Code Length
Group 1
Group_ESC
Group_EOB
DC_Group_0
AC_Group_0
Group 3
Group 2
Selection
signals
dc_ac
intra_pb
6
12
255
"decoder"
EOB L
R
VLC string
VLC string
VLC string
VLC string
VLC string
ESC_L
ESC_R
Run
Level
DC_Group_0
AC_Group_0
Group_EOB
Group_ESC
Group 1
Group 2
Group 3
intra_pb dc_ac
6
12
Valid_decode
End−of−Block
Error
Exit_flag
Controller
6
12 L
R
estimator
Level
DC Y
6
12 L
R
estimator
Level
DC Cb, Cr
VLC string
VLC string
Intra_DC_Size
y_c
mpeg_s
Intra_DC_Size
2:1
MUX
[0...8]
[0...9]
9
10
4
4
Group 1
detector
Group 2
detector
detector
Group 3
SELECTOR
18 Escape
decoder
MPEG−2 [6...23]
12
6
5
24
6
12 2:1
MUX 6
5
decoder
Escape
MPEG−1
[6...27]
22
ESCAPE Decoder
5
12
9
10
[0...8]
[0...9]
Intra DC
Code Length estimator
MUX
2:1
4
ESC_CL [6...27]
12
[0...11]
5
CL
CL
5
4 Size
DC Y
Size
&
Code Length
estimator
DC Cb, Cr
Size
&
Code Length
estimator
5
4 Size
12
6
SELECTOR
12
6
DC
Group 0
decoder
level
L
R [0...1]
2
12
6
AC
Group 0
decoder
level
L
R [0...2]
3
EAB
Group 1
level
decoder
EAB
Group 1
run
decoder
12
6
L
R [1...8]
8
EAB
Group 2
level
decoder
EAB
Group 2
run
decoder
12
6
L
R [6...13]
8
EAB
Group 3
level
decoder
EAB
Group 3
run
decoder
12
6
L
R [9...16]
8
6
12 ESCAPE L
R
Selection
signals
L
R
Fig.9. The VLD implementation on FPGA.6
8
￿
8 IDCT
The functionality of the
8
￿
8 IDCT can be implemented in both software and recon-
ﬁgurable hardware. We will evaluate their performance subsequently.
6.1
8
￿
8 IDCT implementation on standard TriMedia
In the current implementationof the 2-D IDCT on the standard TriMedia/CPU64 archi-
tecture, all computations are done with 16-bit values, and make intense use of SIMD-
style operations. The
8
￿
8 matrix is stored in sixteen 64-bit words, each containing
a half row of four 16-bit elements. Therefore, four
1
6-bit elements can be processed
in parallel by a single word-wide operation. Next to that, being a 5-issue slot VLIW
processor, TriMedia/CPU64 can execute 5 such operations per clock cycle.
This strategy is used for both the horizontal and vertical IDCTs. First, eight 1-D ID-
CTs (two SIMD 1-D IDCTs) are computed using the modiﬁed ‘Loefﬂer’ algorithm [9].
Then, the transpose of the
8
￿
8 matrix is performed by
T
R
A
N
S
P
O
S
E double-slot opera-
tions. Such a unit can generate the upper respectively lower two words of a transposed
4
￿
4 matrix in one cycle. Therefore,the
8
￿
8 matrix transpose is computedin eightba-
sic operations. Finally, eight 1-D IDCTs (two SIMD 1-D IDCTs) are computed having
the results generated by the transposition as inputs. Following the described procedure,
a complete 2-D IDCT including all overheads (mostly composed of load and store op-
erations) can be performed in
5
6 cycles [18].
6.2
8
￿
8 IDCT implementation on extended TriMedia
As described in Section 4, a super-operationwhich can compute the 1-D IDCT on eight
16-bit values represented as two 64-bit words is available in extended TriMedia. The
1-D IDCT operation has a latency of 16, a recovery of 2, and can be issued on the slot
pair 1+2. To calculate the 2-D IDCT, eight 1-D IDCT are ﬁrstly computed. Then, eight
T
R
A
N
S
P
O
S
Esuper-operationsare scheduledon the slot pairs1+2or 3+4 to transposethe
8
￿
8 matrix. Finally, eight 1-D IDCTs complete the 2-D IDCT. Before and after each
2-D IDCT,
L
O
A
D and
S
T
O
R
E operations fetch the input operands from main memory
into register ﬁle, and store the results back into memory, respectively.
In order to keep the pipeline full, back-to-back1-D IDCT operation is needed. That
is, a new 1-D IDCT instruction has to be issued every two cycles. Since true dependen-
cies forbidissuingthe last eight1-DIDCTs ofa2-DIDCTso thattofulﬁll back-to-back
requirement, the 2-D IDCTs are processed in chunks of two, in an interleaved fashion.
A number of
2
￿
1
6
=
3
2 registers are needed for this interleaved processing pattern.
The code was manually scheduled. We found that the computational performance of
2-D IDCT exhibited a throughput of
1
=
3
2 IDCT/cycle and a latency of
4
2 cycles [10].
7 Entropy decoder
The functionality of the entropy decoder can be implemented in both software and
reconﬁgurable hardware. We will evaluate their performance subsequently.more macroblocks
in the current slice ?
more blocks
in the current slice ?
No No
Yes Yes
VLC decoding
Matrix reconstruction
Inverse Q
DC prediction
8x8 IDCT
Activate
IDCT
Context
Activate
VLD
Context
Fig.10. The computing scenario of the macroblock parsing and pel reconstruction routine.
7.1 Entropy decoder implementation on standard TriMedia
The implementation of the entropy decoder in the standard TriMedia is a modiﬁed ver-
sion of that proposed in [25]. The VLD has variable input-output rate, being imple-
mented as a repeated table-lookup. Each lookup decodes a chunk of bits (8 bits at the
ﬁrst level lookup), and determines if a valid code was encountered. In case of a valid
decode, a run-level pair is generated, or an escape or end-of-block ﬂag is set. If a miss
is detected, an offset into the VLC table and a chunk-size for a second-level lookup is
generated. This process of signaling an incomplete decode and generating a new offset
may be repeated three times. When a valid symbol has been encountered, it is stored
into the
8
￿
8 matrix at the location deﬁned by the run value. After compiling the C
code and scheduling procedure, we evaluated that a table lookup takes 21 cycles. Con-
sequently, the entropy decoding of a single DCT coefﬁcient can take between 21 and
63 cycles. The size of all lookup tables is 10 KB.
7.2 Entropy decoder implementation on extended TriMedia
The entropydecoderin the extendedTriMedia beneﬁts of reconﬁgurablehardware sup-
port. By employing software pipelining techniques, useful computationsrelated to run-
length decoding may be performed in the delay slots of the VLD operation. That is, the
8
￿
8 empty matrix is succesively ﬁlled in with level values at the positions speciﬁed
by run values. In this way, a symbol is processed completely in one (ﬁxed latency) iter-
ation. By simulation, we evaluated that a single DCT coefﬁcient can be decoded in 11
cycles including all overheads.
8 Experimental results
In order to determine the potential impact on performance provided by the multiple-
context reconﬁgurable core, we will consider a benchmark which consists of a mac-
roblock parsing followed by pel reconstruction procedures. Therefore, we operate at
MPEG slice level, i.e., the data elements on slice and above layers are assumed to be
constant. The computing scenario is presented in Figure 10. First, a variable-length de-
coding of a macroblock (header and DCT coefﬁcients extraction) is performed. Then,
the
8
￿
8 matrices are recreated, and inverse quantization, followed by DC coefﬁcient
prediction for intra-coded macroblocks are carried out. After all macroblocks in a slice
have been decoded, a burst of 2-D IDCTs is launched in order to reconstruct the initial
pels. During computation, the 1-D IDCT and VLD computing resources are activated
by an
A
C
T
I
V
A
T
E
C
O
N
T
E
X
T, as needed.All the contexts of the RFU are to be conﬁgured at application load time, i.e., a
number of
S
E
T
C
O
N
T
E
X
T instructions are scheduled on the top of the program code. A
sample of the code using the instructionsof the architecturalextensionis presentedsub-
sequently. As it can be observed, the
V
L
D and
I
D
C
T exhibit the same execution pattern:
two inputs and two outputs.
.alias VLD EXEC 3 ; alias of the VLD instruction
.alias IDCT EXEC 3 ; alias of the IDCT instruction
SET CONTEXT VLD ; load context VLD
SET CONTEXT IDCT ; load context IDCT
...
ACTIVATE CONTEXT VLD ; conﬁgure VLD resource
...
VLD Rx, Ry
! Rz, Rw ; execute VLD
...
ACTIVATE CONTEXT IDCT ; conﬁgure IDCT resource
...
IDCT Rx, Ry
! Rz, Rw ; execute IDCT
...
Therefore,ourexperimentincludestwoapproaches:puresoftwareandFPGA-based.
As mentioned, a DCT coefﬁcient is decoded in 21-63 cycles, and a 2-D IDCT can be
computed in 56 cycles in the pure software approach. In the FPGA-based approach,
a DCT coefﬁcient is decoded in 11 cycles, and the 2-D IDCT is carried out with the
throughput of 1/32 IDCT/cycle. Based on the published work in the ﬁeld of multiple-
context FPGAs [16], we make a conservative assumption and consider that the context
switching penality is 10 cycles.
8.1 Pel reconstruction performance evaluation
A program which is MPEG-compliant has been written in C, compiled and scheduled
with TriMedia developmenttools. The performanceevaluationhas been doneassuming
that, despite of the large lookup tables which are stored into memory, the standard
TriMedia/CPU64 will never cope with a cache miss. In other words, we compare an
‘ideal-cache” standard TriMedia with a multiple-context FPGA-augmented TriMedia.
Subsequently, we present the results according to two scenarios: worst-case3 and
average-case. In both cases we assumed that an average of 5 coefﬁcients per block
are decoded. In the worst-case scenario, we assumed that all DCT coefﬁcients produce
a hit on the ﬁrst level lookup when the pure software implementation is used. In the
same worst-case scenario, we also assumed that the overhead introduced by parsing
the macroblock headers has the largest value (for example, the quantization value is
assumed to be updated every macroblock). Since the worst-case scenario coresponds
to long variable-length codes, it is statistically not relevant. Therefore, we evaluated
the performances in a average-case scenario. In such scenario, we assumed that two
of ﬁve DCT coefﬁcients produce a miss at the ﬁrst lookup. Also, we weighted the
3 Considered from our point of view.overheadintroducedby parsing the macroblockheader with the transmiting probability
of different decoding parameters of the macroblock layer. The results are presented in
Table 7. The numbers indicate the improvementswe get for the number of cycles.
Table 7. Performance improvement of multiple-context FPGA-augmented TriMedia/CPU64over
‘ideal-cache” (standard) TriMedia/CPU64 for a macroblock parsing followed by pel reconstruc-
tion application.
Worst-case scenario Average-case scenario
Intra-coded macroblocks prior to IDCT
1
5
%
2
5
%
after IDCT
1
9
%
2
9
%
P-coded macroblocks prior to IDCT
1
0
%
2
1
%
(1 block / macroblock) after IDCT
1
4
%
2
5
%
P-coded macroblocks prior to IDCT
1
3
%
2
4
%
(3 blocks / macroblock) after IDCT
1
8
%
2
7
%
B-coded macroblocks prior to IDCT
8
%
1
7
%
(1 block / macroblock) after IDCT
1
2
%
2
0
%
B-coded macroblocks prior to IDCT
1
1
%
2
2
%
(3 blocks / macroblock) after IDCT
1
7
%
2
5
%
Finally, we proceeded to a global evaluation of the performance improvement. For
an MPEG string with
1
0
% intra-coded,
7
0
% B-coded, and
2
0
% P-coded macroblocks,
the improvement for augmented TriMedia is
2
0
￿
2
5
% in the average-case scenario.
9 Conclusions
We have proposed an architectural extension for TriMedia/CPU64 which encompasses
a multiple-context FPGA-based reconﬁgurable functional unit and the associated in-
structions. On the augmented TriMedia/CPU64, we estimated a performance improve-
ment of
2
0
￿
2
5
% over a standard TriMedia/CPU64 for a macroblock parsing fol-
lowed by a pel reconstruction application, at the expenses of three new instructions:
S
E
T
C
O
N
T
E
X
T,
A
C
T
I
V
A
T
E
C
O
N
T
E
X
T,
E
X
E
C
U
T
E. As future work, we intend to consider
the motion compensation and to evaluate the performance improvementfor a complete
MPEG decoder.
References
1. Razdan, R., Smith, M.D.: A High Performance Microarchitecture with Hardware-
Programmable Functional Units. In: 27th Annual Intl. Symposium on Microarchitecture
– MICRO-27, San Jose, California, (1994) 172–180.
2. Wittig, R.D., Chow, P.: OneChip: An FPGA Processor With Reconﬁgurable Logic. In: IEEE
Symposium on FPGAs for Custom Computing Machines, Napa Valley, California, (1996)
126–135.
3. Hauser, J.R., Wawrzynek, J.: Garp: A MIPS Processor with a Reconﬁgurable Coprocessor.
In: IEEE Symposium on FPGAs for Custom Computing Machines, Napa Valley, California,
(1997) 12–21.4. Kastrup, B., Bink, A., Hoogerbrugge, J.: ConCISe: A Compiler-Driven CPLD-Based In-
structionSetAccelerator. In:IEEESymposium onFPGAsfor Custom Computing Machines,
Napa Valley, California, (1999) 92–100.
5. Mitchell, J.L., Pennebaker, W.B.,Fogg, C.E., LeGall, D.J.: MPEG Video Compression Stan-
dard. Chapman & Hall, New York, New York (1996).
6. Sun, M.T.: Design of High-Throughput Entropy Codec. In: VLSIImplementations for Image
Communications. Volume 2. Elsevier Science Publishers B.V., Amsterdam, The Netherlands
(1993) 345–364.
7. Rao, K.R., Yip, P.: Discrete Cosine Transform. Algorithms, Advantages, Applications. Aca-
demic Press, San Diego, California (1990).
8. Loefﬂer, C., Ligtenberg, A., Moschytz, G.S.: Practical Fast 1-D DCT Algorithms with 11
Multiplications. In: Intl. Conference on Acoustics, Speech, and Signal Processing (ICASSP
’89), (1989) 988–991.
9. van Eijndhoven, J., Sijstermans, F.: Data Processing Device and method of Computing the
Cosine Transform of a Matrix. PCT Patent No. WO 9948025 (1999).
10. Sima, M., Cotofana, S., van Eijndhoven, J.T., Vassiliadis, S., Vissers, K.:
8
￿
8 IDCT Imple-
mentation on an FPGA-augmented TriMedia. In: IEEE Symposium on FPGAs for Custom
Computing Machines, Rohnert Park, California, (2001).
11. Mukherjee, A.,Ranganathan, N.,Bassiouni, M.: EfﬁcientVLSIDesignfor DataTransforma-
tion of Tree-Based Codes. IEEE Transactions on Circuits and Systems 38 (1991) 306–314.
12. Kinouchi, S.,Sawada, A.: Variable Length Code Decoder. U.S.Patent No. 6,069,575 (2000).
13. Lei, S.M., Sun, M.T.: An Entropy Coding System for Digital HDTV Applications. IEEE
Transactions on Circuits and Systems for Video Technology 1 (1991) 147–155.
14. Brown, S., Rose, J.: Architecture of FPGAs and CPLDs: A Tutorial. IEEE Transactions on
Design and Test of Computers 13 (1996) 42–57.
15. DeHon, A., T. Knight, J., Tau, E., Bolotski, M., Eslick, I., Chen, D., Brown, J.: Dynamically
Programmable Gate Array with Multiple Context. U.S. Patent No. 5,742,180 (1998).
16. Trimberger, S., Carberry, D., Johnson, A., Wong, J.: A Time-Multiplexed FPGA. In: IEEE
Symposium on FPGAs for Custom Computing Machines, Napa Valley, California, (1997)
22–28.
17. ***: ACEX 1K Programmable Logic Family. Altera Datasheet, San Jose, California (2000).
18. van Eijndhoven, J.T.J., Sijstermans, F.W., Vissers, K.A., Pol, E.J.D., Tromp, M.J.A., Struik,
P., Bloks, R.H.J., van der Wolf, P., Pimentel, A.D., Vranken, H.P.E.: TriMedia CPU64 Ar-
chitecture. In: Intl. Conference on Computer Design, Austin, Texas, (1999) 586–592.
19. Sima, M., Vassiliadis, S., Cotofana, S., van Eijndhoven, J.T., Vissers, K.: A Taxonomy
of Custom Computing Machines. In: First PROGRESS Workshop on Embedded Systems,
Utrecht, The Netherlands, (2000) 87–93.
20. Pol, E.J.D., Aarts, B.J.M., van Eijndhoven, J.T.J., Struik, P., Sijstermans, F.W., Tromp,
M.J.A., van de Waerdt, J.W., van der Wolf, P.: TriMedia CPU64 Application Development
Environment. In: Intl. Conference on Computer Design, Austin, Texas, (1999) 593–598.
21. van Eijndhoven, J.: 16-bit compliant software IDCT on TriMedia/CPU64. Internal Report,
Philips Research Laboratories (1997).
22. ***: IEEE Standard Speciﬁcations for the Implementations of
8
￿
8 Inverse Discrete Cosine
Transform. IEEE Std 1180-1990 (1991).
23. Choi, S.B., Lee, M.H.: High Speed Pattern Matching for a Fast Huffman Decoder. IEEE
Transactions on Consumer Electronics 41 (1995) 97–103.
24. Min, K.-y., Chong, J.-w.: A Memory-Efﬁcient VLC decoder Architecture for MPEG-2 Ap-
plication. In: IEEE Workshop on Signal Processing Systems, Lafayette, Louisiana, (2000)
43–49.
25. Pol, E.J.D.: VLD Performance on TriMedia/CPU64. Internal Report, Philips Research Lab-
oratories (2000).Several considerations about the latency of an RFU-conﬁgured computing resource
are worth to be provided. Due to realization constraints, the RFU is likely to be lo-
cated far away from the Register File (RF) in the ﬂoorplan of the TriMedia/CPU64.
The immediate effect is that there will be large delays in transferring data between
the RFU and RF, and the RFU will not beneﬁt from bypassing capabilities of the RF
[18]. Consequently, read and write back cycles have explicitely to be provided. In such
circumstances, the minimum latency of an RFU-based computing resource includes at
least 1 cycle for reading the input arguments from register ﬁle, the absolute minimum
combinatorial delay
￿FPGA on FPGA, and 1 cycle for writing back the results to the
register ﬁle. Assuming that the FPGA clock frequency is equal with half of TriMedia
clock frequency [10], the absolute minimum RFU latency is 4 TriMedia cycles. Since
a call of an RFU is quite expensive, it would be a good idea to minimize the number of
RFU calls, i.e., computing resources which can perform complex operations have to be
conﬁgured on the RFU.Constraints and freedomsin conﬁguringa VLD computing resource (on FPGA ??):
– Thelatencyofsuchcomputingresourceshouldbe knownat compilingtime. There-
fore, no beneﬁts from decoding preferentially the short (high probable) codewords
can be achieved.
– The latency of such computing resource should be as small as possible, as the only
way to speedupthedecodingprocess.Pipeliningis ofnouse herevezi articolele cu
sistemele cu react ¸ie care nu se preteaz˘ a la pipelining.
– There are 12 EABs (
2
5
6
￿
1
6 words) on an EP1K100 (???). Therefore, the preﬁx
methodology and, consequently, partitioning the VLC tables should be performed
according to this FPGA architectural caracteristic.
Generally speaking, a constant-output-rate VLD computes the codeword length by
comparingthe leadingbitsof theincomingbit stream againsta small table. Thedecoder
thensendsthecodelengthandtheleadingbitstootherfeed-forwardcircuitryforfurther
decoding and immediately shifts the input by a number of bits equal with code length,
to move to the new leading bits of the input bit stream for decoding the next codeword.
The critical path within the system is always the feedback path because other feed-
forward paths can be pipelined. That is, the processing speed is limited by the feedback
computation time: the time for comparing and selecting the codeword length plus the
time for shifting the input [pag. 198, Lin & Messerschmitt, part. II]. The latency of
computing the feedback value sets the decoding cycle time, and is thus inversely pro-
portional to the decoding rate.
The performance metric is throughput, i.e., the net decoder information rate. This
rate equals the number of bits or codewords decoded per cycle multiplied by the clock
rate. There is a trade-off between these two terms; the more bits or codewords we try to
decode in one cycle, the more complicated the PLA (look-up table !) will become and
the slower the clock rate is. Pay attention! TriMedia has a ﬁxed clock rate, the clock
frequency is constraint, it is an input datum to the design of an MPEG decoder.
In cases where the number of codewords in the table is large, there are some bits
that are common to the long VLC’s, which we call preﬁx. By exploiting these common
preﬁxes, the size of the LUT can be reduced. A number of schemes such as preﬁx
precoding [Choi Lee], [Min Chong] and table partitioning [Cho Xanthopoulos...] have
been presented ...
ForFPGA-2002:advancedcomputationofthenextcodelength.Theselectionofthe
properresultisperformedsimultaneouslywiththeselectionoftheproperrunandlength
of the current word. Also, in parallel, computing the run and length of the previous
codeword is carried out.