Towards an optimised VLSI design algorithm for the constant matrix multiplication problem by Kinane, Andrew et al.
Towards an Optimised VLSI Design Algorithm for
the Constant Matrix Multiplication Problem
Andrew Kinane, Valentin Muresan and Noel O'Connor
Centre for Digital Video Processing, Dublin City University, Dublin 9, IRELAND
Email: kinanea@eeng.dcu.ie
Abstract- The efficient design of multiplierless implementa- The goal is to find the optimal sub-expressions across all N dot
tions of constant matrix multipliers is challenged by the huge products in (3) that lead to the fewest adder resources needed.
solution search spaces even for small scale problems. Previous ap- Three properties aid the classification of approaches: SD
proaches tend to use hill-climbing algorithms risking sub-optimal.
.
results. The proposed algorithm avoids this by exploring parallelpsteg a
solutions. The computational complexity is tackled by modelling 1) SD Permutation: Consider that each of the N x N M-bit
the problem in a format amenable to genetic programming and fixed point constants ai3 have a finite set of possible SD rep-
hardware acceleration. Results show an improvement on state of resentations. For example with M = 4 the constant (-3)10 can
the art algorithms with future potential for even greater savings, be represented as either (0011)2, (0101)2(1101)2, (0111)2, (1)2.
For the optimal number of adders to be found, all SD
I. INTRODUCTION representations of the ai3 should be considered since for a
Applications involving the multiplication of variable data CMM problem Canonic Signed Digit (CSD) representation
by constant values are prevalent throughout signal process- is not guaranteed to be optimal (as shown in section V).
ing. Some common tasks that involve these operations are The difficulty is that the solution space is very large, hence
Finite Impulse Response filters (FIRs), the Discrete Fourier SD permutation has only thus far been applied to simpler
Transform (DFT) and the Discrete Cosine Transform (DCT). problems [2]. Potkonjak et. al. acknowledge the potential of
Optimisation of these kind of constant multiplications will SD permutation but choose a single SD representation for
significantly impact the performance of such tasks and the each ai3 using a greedy heuristic. Neither of the recent CMM-
global system that uses them. The examples listed are instances specific algorithms apply SD permutation [3], [4].
of a more generalised problem - that of a linear transform 2) Pattern Search: The pattern search goal is to find the
involving a constant matrix multiplication (CMM). The prob- sub-expressions in the 3D bit matrix bijk resulting in fewest
lem is summarised as follows: substitute all multiplications adders. Usually bijk is divided into N 2D slices along the
by constants with a minimum number of shifts and addi- i plane (i.e. taking each CMM dot product in isolation).
tions/subtractions (we refer to both as "additions") [1]. The Patterns are searched for in the 2D slices independently before
optimisation criterion may be extended beyond adder count combining the results for 3D. An example 2D slice is shown
only and include factors like routability, glitching etc. but is in (4), a 4-point dot product with random 12-bit SD constants.
restricted to adder count in this paper. 0 0 1 l
II. PROBLEM STATEMENT 0 1 0 O
A CMM equation y =Ax (where y, x are N-point ID I 1 0 0
data vectors and A is an N x N matrix of M-bit fixed-point 2-1 - 0 0 1 1
constants) may be thought of as a collection of N dot products 2-10 1 1 0 0 F xO 1
with each dot product yi expressed as follows: 0 1 1 XI (4)
Each constant may be represented in signed digit(SD)form:10 1 1 0 0
M 1 0 0 0 0
ai
3
= E &l.2bjk E {lo,1,}, 1 =-1 (2) 2D) slice of b.,k
Combining (1) and (2) yield a multiplierless dot product Algorithms may search for horizontal/vertical patterns (PlD)
implementation requiring only adders and shifters: or diagonal patterns (P2D) in the 2D slice. The PlD strategy
N-iM-1 infers a two-layer architecture of a network of adders (with no
Si SE &jk2Xzj, 0,.N.. - 1 (3) shifting of addends) to generate distributed weights for eachj=o k=o row followed by a fast partial product summation tree (PPST)
0-7803-9390-2/06/$20.00 ©2006 IEEE 5111 ISCAS 2006
XO X0 -10- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~CMM Equation
XI xi Unique Fixed-Point Constant Permutation Evaluation
U)e
_0' aiet uf files for each unique constant in matrix
c) ~ ' fl / : <a..Files store signed digit permutations for that constant
XN-I* XN- I Dot Product Level (DPL) for(all permutations)
Parallel Processing loadPermutation ();
Fig. 1. PID Architecture. Fig. 2. P2D Architecture. elimlsmuipEquivso;
ot roduct o uct D insertPerm0SOP(
0 1 N-i
Sub-Problem Permutations ordered in terms of number of adders required for each DP
aoo a01 a02 a03 Poin tPoin tPoint
CMM Level (CMML) Option 1 - Exhaustive Option 2 - Genetic Algorithm
A N rows of chunks initPop();A
~~~~~~~~~~~~~N/r columns of *-Merges DPL results for(all permtatios~){ whil(!terination){
chunks * Ordered search space increases evalPermFtnsso; selection( )
* Each column a permutation reduction potential recombination
CMM sub-problem *Pipelined implementation mtto
facilitates dynamic feedback to
DPL stage for complex CMM
Permutations ordered in terms of number of adders required for entire CMM
Fig. 3. CMML Divide and Conquer.
Fig. 4. Summary of the CMM Optimisation Algorithm.
to carry out the shift accumulate (Fig. 1). The P2D strategy
infers a one-layer architecture (Fig. 2) of a network of adders by evaluating parallel solutions for each permutation. Such an
that in general may have shifted addends (essentially merging approach ls computationally demanding but the algorithm has
the two layers of the PID strategy). Potkonjak et. al. use the been modelled with this in mind and incorporates innovative
PID strategy and search for horizontal patterns while others fast search technlques to reduce this burden.
use the P2D strategy [3], [4]. However, these proposals select The proposed algorithm permutes the SD representations of
sub-expressions iteratively based on some heuristic criteria that the constants in A For each permnutation, parallel solution op-
may preclude an optimal realisation of the global problem. tions are builtbased on different sub-expression choices. These
This is because the order of sub-expression elimination affects parallel implementations are expressed as a sum of products
the results [5]. (SOP), where each product term in the SOP represents a
3) Synthesis: As in any hardware optimisation problem, particular solution (with an associated adder count). The SD
synthesis issues should be considered when choosing sub- permutation is done on each CMM dot product in isolation
expressions for an N-point dot product (a 2D slice). If N is (Section IV-A), and the results are subsequently combined
large (e.g. 1024-point FFT) then poor layout regularity may (Section IV-B). The algorithm searches for the combined SOP
result from complex wiring of sub-expressions from taps large that represents the overall best (in terms of adder count) sub-
distances apart in the data vector. Indeed a recent paper has expression configuration to implement the CMM equation.
shown that choosing such sub-expressions can result in a speed Previous approaches derive one implementation option (akin
reduction and greater power consumption [6]. It is therefore to a single term SOP) whereas the proposed approach derives
sensible to divide each N-point dot product into N/r r-point parallel implementations (a multi-term SOP). It is this multi-
chunks and optimise each sub dot product independently. The term SOP approach and its manipulation (Section IV) that
CMM problem hence becomes N/r independent sub problems, make the algorithm suitable for GAs and hardware accelera-
each with N dot products of length r (Fig. 3). The optimal tion.
choice of r is problem dependent, but the proposed algorithm The proposed algorithm currently uses the PID strategy,
currently uses r =4 for reasons outlined subsequently. so it searches for horizontal sub-expression patterns of {+1}
digits in a 2D slice. The proposed SOP modelling idea can be
III. PROPOSED EFFICIENT MODELLING SOLUTION extended to cover the P2D strategy by simply extending the
The CMM problem is a difficult discrete combinatorial digit set from {+1} to {1, +2,+, +4,7 ±,.
problem and currently requires a shift to a higher class of
algorithms for more robust near-optimal solutions. This is IV. THE PROPOSED CMM OPTIMISATION ALGORITHM
because the current approaches are greedy hill-climbing algo- The proposed approach is a three stage algorithm as sum-
rithms and the associated results are very problem dependent marised in Fig. 4. Firstly all SD representations of the M-bit
[5]. The challenge is in the modelling of the problem to fixed point constants are evaluated using an M-bit radix-2 SD
make it amenable to efficient computation. The algorithm counter. Then, each dot product in the CMM are processed
proposed here models the problem in such a way as to make it independently by the dot product level (DPL) algorithm. Fi-
amenable to so-called near-optimal algorithms (genetic algo- nally the DPL results are merged by the CMM level (CMML)
rithms (GAs), simulated annealing, tabu-search) and also hard- algorithm. The three steps may execute in a pipelined manner
ware acceleration. The proposed approach incorporates SD with dynamic feedback between stages. This offers search
permutation of the matrix constants and avoids hill-climbing space reduction potential as outlined subsequently.
5112
"Skip Node List" (SNL) - Increasing hw
A. Dot Product Level (DPL) Stage
The DPL algorithm iteratively builds a SOP, and the final hw hw h
next skip* next-skip* ------ next&skip* 0SOP terms are the unique sub-expression selection options top p top p topz
after considering all SD permutations of the dot product con- pvec pvec p vec
stants in question. The final SOP terms are listed in increasing \ pd pid o
nextp ~ nextp~ nextp ~ 0
order of the number of adders required by the underlying sub-
expressions. The DPL algorithm executes the following steps .
for each SD permutation. a
Step]: Load the next SD permutation of the dot product pid pveC pid D Z
constants. This corresponds to a 2D slice of bij,, e.g. in (4). nextpS p next&p r
Step2: Using the PID strategy, slice rows with < 1 non- NULL NULL NULL
zeros need not be considered for sub-expression sharing and Fig. 5. DPL Skiplist Arrangement.
are eliminated.
Step3: It may be the case that there are occurrences where a problem one must consider the CMM level, and it may be that
certain pattern is on one row and it's 1's complement occurs on permuting the first option at CMML gives in a better overall
another row. Such pairs infer the same unique set of additions, result since it may overlap better with requirements for the
albeit the final output is the ± of the other. Thus one of other dot products. Hence it is necessary to store the entire
these rows can be eliminated since only additional inverters SOP for each permutation at DPL and then permute these at
are required and the '1' added to the LSB can be subsumed CMML to get the guaranteed optimal.
by the PPST. StepS: The algorithm checks each term in the current SOP
Step4: Steps 2 and 3 reduce the 2D slice in (4) to the left produced by step 4 to see if it has already been found with
matrix in (5). The DPL algorithm considers each row in turn a previous permutation. If so it is discarded - only unique
and builds an implementation SOP for that row, as in (5). implementations are added to the global list. This global listis implemented using a 2D skip list to minimise the overhead
0 0 1 1 (PI1) AND of searching it with a new term from the current permutation
1 1 0 0 (P6) AND SOP (Fig. 5). In the horizontal direction there are "skip nodes"
0 i 1 1 ((P3)(P51) OR (PIo)(P52) OR (Pll)(P53)) AND ordered from left to right in order of increasing hw in the skip
0 1 0 1 (plo) AND node list (SNL). In the vertical direction there are "product
1 1 0 0 (po) nodes" and each skip node points to a product node list (PNL)
(5) of ordered product nodes where each product node in the
Each SOP term is represented internally as a data structure PNL has the same number of bits set (i.e. hw) in its p_vec
with elements p_vec (a bit vector where each set bit rep- bit vector. Therefore if a new node is presented with hw =
resents a specific adder to be resource allocated) and hw hWnew and p_vec = p_veC,w it makes sense to only search
(the Hamming weight of p_vec that records the total adder the subset of nodes in the global list that have the same
requirement). The number of possible two input additions value hWn,w (i.e. only search one particular PNL). The PNLs
is equivalent to the combinatorial problem of leaf-labelled are ordered in increasing order of their p_vec, if p_vec is
complete rooted binary trees [7]. With r = 4, the number considered as a 180-bit integer. Therefore when iterating over a
of possibilities is 180 (proof omitted to save space) and the PNL at a given skip node and a node is found that has a p_ve c
general series in r increases quickly for r > 4. We are currently (p_vecc,,) such that p_vec,, > p-vecnw. it is guaranteed
researching an automated method for configuring the DPL that p-veCn,w is not already in the list and can be inserted at
algorithm for any r. So each p_vec is a 180-bit vector with this point. When inserting into the list a unique permutation
a hw equal to the number of required adders. The SOPs for ID (pid) is added to the node along with p_vec so that the
each row are logically ORed together to form a permutation SD permutation that generated it can be reconstructed. If the
SOP that is an exhaustive set of sub-expressions options that condition p_vec,, = p-vecnw. is true, then the SOP term
implement the entire permutation. The permutation SOP for already exists in the PNL so the new node is discarded and
(4) is given by (6) where pm means bit v is set in the 180-bit the search terminates immediately for this SOP term.
p-vec for that SOP term. The DPL algorithm is dominated by low level operations
((PII)(P6)(P3)(P5I)(Plo)(Po))OR such as comparisons, Boolean logic and bit counting. Indeed
((PII)(P6) (Plo)(P52)(Po)) OR (6) profiling shows that on average 60% of the computation time is
((PII)(P6)(p53)(PlO)(PO)) consumed by bit counting (50%) and bitwise OR (10%). Suchtasks can readily be accelerated in hardware by mapping SOPs
The first term in (6) has hw = 6 so it requires 6 unique to a FIFO structure and the logic OR operations to OR gates.
additions (+PPST) to implement (4) whereas the latter two
options only require 5 unique additions (+PPST). Obviously B. Constant Matrix Multiplication Level (CMML) Stage
one of the latter two options is more efficient if implementing Once the DPL algorithm has run for each of the dot products
this dot product in isolation. However when targeting a CMM in the CMM, there will be N 2D skip lists - one for each of
5113
TABLE I
the N dot products examined. The task now is to find the best ID 8-POINT DCT ADDER UNIT FULL ADDER REQUIREMENTS
overlapping product nodes for all of the CMM dot products. It CMM Initial [1] [5] [4] Proposed Approach
is expected (though not guaranteed) that since the skiplists are + + + + FA + FA FA%
ordered with the lowest hw PNL first, the optimal result will DCT 8bit 300 94 65 56 739 78 730 1.2
be converged upon quickly saving needless searching of large DCT 12bit 368 100 76 70 1202 109 1056 12.1
areas of the permutation space. The CMM Level (CMML) DCT 16bit 521 129 94 89 2009 150 1482 262
algorithm searches for the optimal overlapping nodes from the most promising regions of the CMML search space are
each of the DPL lists. examined first. The hypothesis of achieving extra saving by
The CMML algorithm permutes the terms in each skiplist permuting the SD representations is validated by the fact that
with terms from others, starting from the top of each. For each the best SD permutation yielding the results in Table I are not
permutation the N product nodes are combined using bitwise the CSD permutation.
OR and bit counting similar to the techniques used in the DPL Although the savings achieved are incremental, there exists
algorithm. The value of hw of the combined node represents significant potential for improvement:
the number of adders necessary to implement the CMM for i: Investigation into the optimal value of r - that is the
the current permutation. The potential exists to use lowest hw optimal sub division of large CMM problems into indepen-
value found thus far to rule out areas of the search space. For dent chunks. This can only be truly evaluated if synthesis
example if an improved value of hw = 5 is found for a CMML parameters such as fanout and routability are included in the
solution, there is no point in searching DPL PNLs with hw > 5 optimisation criterion as well as FA count.
since they are guaranteed not to overlap with other DPL PNLs ii: The integration of the P2D strategy mentioned earlier. It
and give a better result than 5. The current best value of hw at is likely that there exists a maximum number of rows apart in
CMML level could also be fed back to the DPL algorithm to the bijk slice diagonal patterns forming useful sub-expressions
reduce the size of the skiplists generated by DPL (and hence will be. This is because if sub-expression addends come from
permutation space) without compromising optimality. rows far apart in bij,k, the adders inferred have a large bitwidth.
Although the ordering of the search space makes it more iii: Optimal tuning of the CMML GA parameters to search
likely for the exhaustive CMML algorithm to find the best the permutation space most effectively.
solution relatively quickly, the huge permutation space means VI. CONCLUSIONS
that the exhaustive CMML approach is not tractable. However, The general multiplierless CMM design problem has a huge
the proposed modelling of the CMM problem and bit vector
search space especially if different SD representations of
representation of candidate solutions means that the CMML
t
y
algoith iseryamenbleto Gs. he bt vctor ca be the matrix constants are considered. The proposed algorithm
algorinterp as chromosomesnandthevAlue oft hwctoscanbeu addresses this by organising the search space effectively andinterpreted as chromosomes and the value of hw can be used mdligtedt nafsinaeal oGsadhrwr
to build an empirical fitness function (the less adders required
acceleration. Preliminary results validate the approach andthe fitter the candidate). Details of a suitable GA can be found ac e i on resultvlate the aro a
in [8]. The current fitness function is based upon the number
of adders but may in future be extended to include parameters REFERENCES
such as layout, and speed. [1] M. Potkonjak, M. B. Srivastava, and A. P. Chandrakasan, "Multiple
Constant Multiplications: Efficient and Versatile Framework and Algo-
V. EXPERIMENTAL RESULTS rithms for Exploring Common Subexpression Elimination," IEEE Trans.
Computer-Aided Design, vol. 15, no. 2, pp. 151-165, Feb. 1996.
For a fair comparison with other approaches, the number of [2] A. G. Dempster and M. D. Macleod, "Digital Filter Design Using
1-bit full adders (FAs) allocated in each optimised architecture Subexpression Elimination and all Signed-Digit Representations," in Proc.
should be used as opposed to "adder units", since the bitwidth IEEE International Symposium on Circuits and Systems, vol. 3, May
for ech uit i unsecifed i othe pubicatons partfrom[4]. 2004, pp. 169-172.for each unit is unspecifi n r bli ion a rt [4]. [3] M. D. Macleod and A. G. Dempster, "Common subexpression elimination
FA count more accurately represents circuit area requirements. algorithm for low-cost multiplierless implementation of matrix multipli-
Using the 8-point ID DCT (N = 8 with various M) as a ers," IEE Electronics Letters, vol. 40, no. 11, pp. 651-652, 2004.[4] N. Boullis and A. Tisserand, "Some Optimizations of Hardware Multi-benchmarking CMM problem, Table I compares results with plication by Constant Matrices," IEEE Trans. Comput., vol. 54, no. 10,
other approaches based on adder units and FAs where possible. pp. 1271-1282, Oct. 2005.
Our approach compares favourably with [4] in terms of FAs [5] M. D. Macleod and A. G. Dempster, "Multiplierless FIR Filter Design
Algorithms," IEEE Signal Processing Lett., vol. 12, no. 3, pp. 186-189,(see FA% savings in Table I), even though this gain iS not Mar. 2005.
reflected by the number of adder units required. [6] M. Martinez-Peiro, E. I. Boemo, and L. Wanhammar, "Design of High-
The modelling approach used means that the proposed Speed Multiplierless Filters Using a Nonrecursive Signed Common
Subexpression Algorithm," IEEE Trans. Circuits Syst. II, vol. 49, no. 3,DPL algorithm iS tractable and can search the DPL SD pp. 196-203, Mar. 2002.
permutation space exhaustively in the order of hours (although [7] 5. D. Andres, "On the number of bracket structures of n-operand opera-
this increases with M). However, the CMML search space is tions constructed by binary operations," 2005, private communication.[8] A. Kinane, V. Muresan, and N. O'Connor, "Optimisation of Constant Ma-huge and the results presented here are based on searching trix Multiplication Operation Hardware Using a Genetic Algorithm," in
<1% of the possibilities (using an untuned GA). However, it Proc. 3rd European Workshop on Evolutionary Computation in Hardware
is expected that the ordering of the DPL solutions means that Optimisation (EvoHOT), Budapest, Hungary, Apr. 10-12, 2006.
5114
