Abstract-The efficient design of multiplierless implementa-The goal is to find the optimal sub-expressions across all N dot tions of constant matrix multipliers is challenged by the huge products in (3) that lead to the fewest adder resources needed. solution search spaces even for small scale problems. Previous ap-Three properties aid the classification of approaches: SD proaches tend to use hill-climbing algorithms risking sub-optimal . 
I. INTRODUCTION representations of the ai3 should be considered since for a Applications involving the multiplication of variable data CMM problem Canonic Signed Digit (CSD) representation by constant values are prevalent throughout signal process-is not guaranteed to be optimal (as shown in section V). ing. Some common tasks that involve these operations are The difficulty is that the solution space is very large, hence Finite Impulse Response filters (FIRs), the Discrete Fourier SD permutation has only thus far been applied to simpler Transform (DFT) and the Discrete Cosine Transform (DCT). problems [2] . Potkonjak et. al. acknowledge the potential of Optimisation of these kind of constant multiplications will SD permutation but choose a single SD representation for significantly impact the performance of such tasks and the each ai3 using a greedy heuristic. Neither of the recent CMMglobal system that uses them. The examples listed are instances specific algorithms apply SD permutation [3] , [4] . of a more generalised problem -that of a linear transform 2) Pattern Search: The pattern search goal is to find the involving a constant matrix multiplication (CMM). The prob-sub-expressions in the 3D bit matrix bijk resulting in fewest lem is summarised as follows: substitute all multiplications adders. Usually bijk is divided into N 2D slices along the by constants with a minimum number of shifts and addi-i plane (i.e. taking each CMM dot product in isolation). tions/subtractions (we refer to both as "additions") [1] . The Patterns are searched for in the 2D slices independently before optimisation criterion may be extended beyond adder count combining the results for 3D. An example 2D slice is shown only and include factors like routability, glitching etc. but is in (4), a 4-point dot product with random 12-bit SD constants. restricted to adder count in this paper. use the P2D strategy [3] , [4] . However, these proposals select The proposed algorithm permutes the SD representations of shown that choosing such sub-expressions can result in a speed Previous approaches derive one implementation option (akin reduction and greater power consumption [6] . It is therefore to a single term SOP) whereas the proposed approach derives sensible to divide each N-point dot product into N/r r-point parallel implementations (a multi-term SOP). It is this multichunks and optimise each sub dot product independently. The term SOP approach and its manipulation (Section IV) that CMM problem hence becomes N/r independent sub problems, make the algorithm suitable for GAs and hardware acceleraeach with N dot products of length r (Fig. 3) . The optimal tion. choice of r is problem dependent, but the proposed algorithm The proposed algorithm currently uses the PID strategy, currently uses r =4 for reasons outlined subsequently. so it searches for horizontal sub-expression patterns of {+1} digits in a 2D slice. The proposed SOP modelling idea can be III. PROPOSED EFFICIENT MODELLING SOLUTION extended to cover the P2D strategy by simply extending the The CMM problem is a difficult discrete combinatorial digit set from {+1} to {1, +2, +, +4,7 ±,.
problem and currently requires a shift to a higher class of algorithms for more robust near-optimal solutions. This is IV. THE PROPOSED CMM OPTIMISATION ALGORITHM because the current approaches are greedy hill-climbing algo-
The proposed approach is a three stage algorithm as sumrithms and the associated results are very problem dependent marised in Fig. 4 . Firstly all SD representations of the M-bit [5] . The challenge is in the modelling of the problem to fixed point constants are evaluated using an M-bit radix-2 SD make it amenable to efficient computation. The algorithm counter. Then, each dot product in the CMM are processed proposed here models the problem in such a way as to make it independently by the dot product level (DPL) algorithm. Fiamenable to so-called near-optimal algorithms (genetic algo-nally the DPL results are merged by the CMM level (CMML) rithms (GAs), simulated annealing, tabu-search) and also hard-algorithm. The three steps may execute in a pipelined manner ware acceleration. The proposed approach incorporates SD with dynamic feedback between stages. This offers search permutation of the matrix constants and avoids hill-climbing space reduction potential as outlined subsequently. Step3: It may be the case that there are occurrences where a problem one must consider the CMM level, and it may be that certain pattern is on one row and it's 1's complement occurs on permuting the first option at CMML gives in a better overall another row. Such pairs infer the same unique set of additions, result since it may overlap better with requirements for the albeit the final output is the ± of the other. Thus one of other dot products. Hence it is necessary to store the entire these rows can be eliminated since only additional inverters SOP for each permutation at DPL and then permute these at are required and the '1' added to the LSB can be subsumed CMML to get the guaranteed optimal. by the PPST.
StepS: The algorithm checks each term in the current SOP Step4: Steps 2 and 3 reduce the 2D slice in (4) to the left produced by step 4 to see if it has already been found with matrix in (5). The DPL algorithm considers each row in turn a previous permutation. If so it is discarded -only unique and builds an implementation SOP for that row, as in (5). (5) of ordered product nodes where each product node in the Each SOP term is represented internally as a data structure PNL has the same number of bits set (i.e. hw) in its p_vec with elements p_vec (a bit vector where each set bit rep-bit vector. Therefore if a new node is presented with hw = resents a specific adder to be resource allocated) and hw hWnew and p_vec = p_veC,w it makes sense to only search (the Hamming weight of p_vec that records the total adder the subset of nodes in the global list that have the same requirement). The number of possible two input additions value hWn,w (i.e. only search one particular PNL). The PNLs is equivalent to the combinatorial problem of leaf-labelled are ordered in increasing order of their p_vec, if p_vec is complete rooted binary trees [7] . With r = 4, the number considered as a 180-bit integer. Therefore when iterating over a of possibilities is 180 (proof omitted to save space) and the PNL at a given skip node and a node is found that has a p_ve c general series in r increases quickly for r > 4. We are currently (p_vecc,,) such that p_vec ,, > p-vecnw. it is guaranteed researching an automated method for configuring the DPL that p-veCn,w is not already in the list and can be inserted at algorithm for any r. So each p_vec is a 180-bit vector with this point. When inserting into the list a unique permutation a hw equal to the number of required adders. The SOPs for ID (pid) is added to the node along with p_vec so that the each row are logically ORed together to form a permutation SD permutation that generated it can be reconstructed. If the SOP that is an exhaustive set of sub-expressions options that condition p_vec,, = p-vecnw. is true, then the SOP term implement the entire permutation. The permutation SOP for already exists in the PNL so the new node is discarded and (4) is given by (6) where pm means bit v is set in the 180-bit the search terminates immediately for this SOP term.
p-vec for that SOP term. The DPL algorithm is dominated by low level operations ((PII)(P6)(P3)(P5I)(Plo)(Po))OR such as comparisons, Boolean logic and bit counting. Indeed ((PII)(P6) (Plo)(P52)(Po)) OR (6) profiling shows that on average 60% of the computation time is ((PII)(P6)(p53)(PlO)(PO)) consumed by bit counting (50%) and bitwise OR (10%). Such tasks can readily be accelerated in hardware by mapping SOPs
The first term in (6) has hw = 6 so it requires 6 unique to a FIFO structure and the logic OR operations to OR gates. additions (+PPST) to implement (4) whereas the latter two options only require 5 unique additions (+PPST). Obviously B. Constant Matrix Multiplication Level (CMML) Stage one of the latter two options is more efficient if implementing Once the DPL algorithm has run for each of the dot products this dot product in isolation. However when targeting a CMM in the CMM, there will be N 2D skip lists -one for each of examined first. The hypothesis of achieving extra saving by The CMML algorithm permutes the terms in each skiplist permuting the SD representations is validated by the fact that with terms from others, starting from the top of each. For each the best SD permutation yielding the results in Table I are not permutation the N product nodes are combined using bitwise the CSD permutation. OR and bit counting similar to the techniques used in the DPL Although the savings achieved are incremental, there exists algorithm. The value of hw of the combined node represents significant potential for improvement: the number of adders necessary to implement the CMM for i: Investigation into the optimal value of r -that is the the current permutation. The potential exists to use lowest hw optimal sub division of large CMM problems into indepenvalue found thus far to rule out areas of the search space. For dent chunks. This can only be truly evaluated if synthesis example if an improved value of hw = 5 is found for a CMML parameters such as fanout and routability are included in the solution, there is no point in searching DPL PNLs with hw > 5 optimisation criterion as well as FA count. since they are guaranteed not to overlap with other DPL PNLs ii: The integration of the P2D strategy mentioned earlier. It and give a better result than 5. The current best value of hw at is likely that there exists a maximum number of rows apart in CMML level could also be fed back to the DPL algorithm to the bijk slice diagonal patterns forming useful sub-expressions reduce the size of the skiplists generated by DPL (and hence will be. This is because if sub-expression addends come from permutation space) without compromising optimality.
rows far apart in bij,k, the adders inferred have a large bitwidth. Although the ordering of the search space makes it more iii: Optimal tuning of the CMML GA parameters to search likely for the exhaustive CMML algorithm to find the best the permutation space most effectively. [8] . The current fitness function is based upon the number of adders but may in future be extended to include parameters
