Introduction
In recent years, hypercomplex numbers called quaternions attract attention of many researchers of the fields of digital signal processing (DSP), control, computer graphics, telecommunications, and others. By using hypercomplex arithmetic, known algorithms can be improved or extended to 4 dimensions so as to find new applications (Alexiadis & Sergiadis, 2009; Chan et al., 2008; Denis et al., 2007; Ell & Sangwine, 2007; Karney, 2007; Marion et al., 2010; Parfieniuk & Petrovsky, 2010a; Seberry et al., 2008; Took & Mandic, 2010; Tsui et al., 2008; Zhou et al., 2007) . Current research is mainly focused on theoretical development of quaternion-based algorithms, but one can expect that, in the course of time, engineers and scientists will implement them in hardware, and thus will need building blocks, design insights, methodologies, and tools. In known algorithms that use hypercomplex arithmetic, the key operation is quaternion multiplication, whose efficiency and accuracy obviously determines the same properties of the whole computational scheme of a filter or transform. Even though the operation has been thoroughly investigated from a mathematical point of view (Howell & Lafon, 1975) , rather little is known about the practical aspects of implementing it in hardware as a dedicated digital circuit. To the best of our knowledge, only two research groups reported development of fixed-point quaternion multipliers. In (Delosme & Hsiao, 1990; Hsiao & Delosme, 1996; Hsiao et al., 2000; Parfieniuk & Petrovsky, 2010b; Petrovsky et al., 2001; Verenik et al., 2007) , they considered various approaches to computing constant-coefficient multiplication using only binary shifts and additions: CORDIC, lifting, and Distributed Arithmetic (DA), but there is no review of the developed computational schemes, which would allow them to be compared and would inspire further research. The present chapter has two aims.
The first one is to briefly review known facts and achievements related to quaternion multipliers and, by presenting a novel CORDIC-Inside-Lifting architecture, to show that there is much to do in this field. The second aim is to present our methodology and design results related to rapid prototyping of different multiplier schemes using a Xilinx Virtex FPGA device. The chapter contents should be useful to persons interested in implementing hypercomplex computations, as it should allow readers to select the architecture that is best suited to their needs and to prototype hardware faster than working from scratch. The rest of the chapter is organized as follows. In Section 2, the properties of quaternion multiplication are briefly reviewed with particular emphasis on its matrix notation and noncommutativity. Section 3 presents possible architectures for multiplying a hypercomplex variable by a constant coefficient using only binary shifts and additions: DA, CORDIC approach, lifting scheme, and a novel combination of the last two schemes. Their general principles are discussed and design trade-offs they offer are considered so as to form a basis for developing digital circuits. Section 4 begins by presenting our rapid prototyping methodology, development platform and tools so as to show how we use Matlab and VHDL-FPGA tools to develop multiplier units. Then, finally, design experiments are reported for different architectures of quaternion multipliers. Throughout this chapter, the following notation is used. Matrices and column vectors are denoted by upper-and lower-case bold faced characters, respectively. The superscript T denotes transposition. I 2 and J 2 denote the 2 × 2 identity and reversal matrices, respectively. Γ 2 = diag (1, −1). The set of natural numbers is denoted by N and is assumed to include zero.
Quaternion multiplication
Quaternions are four-dimensional hypercomplex numbers whose rectangular form comprises arealpartandthreeimaginaryparts:
the latter of which are related to three imaginary units, i, j,andk. Because these units depend on each other in the following way
quaternion multiplication is noncommutative, unless one of its operands is a real number or both are complex numbers. The noncommutativity of quaternion multiplication becomes evident when hypercomplex numbers are identified with vectors, and the operation is written in matrix notation: 
Namely, the left-operand multiplication matrix, M + (·), differs from the right-operand one,
It should be emphasized that our attention is focused on multiplications in which x is an input variable, and q is a constant coefficient. The latter determines the multiplication matrix and thus the linear transformation of input data that has to be implemented in hardware. It is also assumed that both operands are unit-norm quaternions, i.e. |q| = √ q 1 + q 2 + q 3 + q 4 = 1.
It is easy to show (Parfieniuk & Petrovsky, 2010b ) that the multiplication matrices are related in the following way
where q = q 1 − q 2 i − q 3 j − q 4 k denotes the conjugate quaternion, and
is the matrix representation of the hypercomplex conjugate operator, as q = D C q.T h u s a multiplier that computes one variant of quaternion multiplication can be used to compute the other variant. It is only necessary to change the signs of some components of the input, output, and coefficient. From a complexity point of view, sign changes can be neglected, and thus both variants can be considered equivalent.
Multiplierless circuits for multiplying quaternions

Motivation
The straightforward approach to compute quaternion products is to use (3) without exploiting the special structures of the multiplication matrices. It obviously requires doing 16 multiplications and 12 additions of real numbers, but its actual performance highly depends on whether and how vector processing is supported by the hardware and used in implementation. In particular, quaternion computations can be accelerated on PC processors with SIMD extensions (Leiterman, 2003) . On the other hand, for FPGA and ASIC, it is often not desirable, or even possible, to implement real multipliers, because they occupy too much chip area and consume too much energy, offering unsatisfactory throughput. The preferred approach is to replace constant-coefficient multipliers with simple shifts and additions, the former of which can be hardwired. There are several different approaches to implement quaternion multiplication using only binary shifts and additions. In the next three subsections, they are described from a conceptual point of view, whereas practical design results and insights are presented in Section 4.
Distributed arithmetic 3.2.1 General principle
Distributed Arithmetic (DA) is a technique that allows vector inner products (sums of products) to be computed without multiplications (White, 1989) . The fundamental assumption is that one vector comprises fixed coefficients, whereas the elements of the other one are variables. The corresponding bits of the variables are used as an address into a read-only memory (ROM) that contains precomputed partial results: all possible The matrix-by-vector product (3), which describes quaternion multiplication, corresponds to four inner-products, each of which can be realized using DA (Petrovsky et al., 2001; Verenik et al., 2007) . The inner products are related to different constant vectors, which correspond to rows of the multiplications matrix, but use the same variable vector. Assuming that variable data are fractions, |x k |≤1, and are represented using signed 2's-complement with a word length of B bits, the components of x can be expressed as:
where x k,n are consecutive bits: from the sign bit, x k,0 , to least significant bit (LSB), x k,B−1 . Such expansion allows the components of the result r = qx to be expressed in terms of bits of x. In particular,
where
By reordering additions and introducing
wecanobtainthefollowing:
The derivations for the remaining three inner products are similar so they can be omitted. Because every bit x k,n can be either 0 or 1,r 1 (σ(n), x 1,n , x 2,n , x 3,n , x 4,n ) , n = 0,...,B − 1c a n take only 2 4 = 16 possible magnitudes, so its values for σ(n)=1 can be precomputed and stored in a memory. Then, during computation of a quaternion product, x 1,n , x 2,n , x 3,n , and x 4,n can be used as an address into the table, from where partial results are read. After accumulating B appropriately shifted partial results, the final result is obtained.
The corresponding scheme of a DA-based quaternion multiplier is shown in Fig. 1 , where the "P/S" block is a parallel-to-serial converter. The sign flag σ(n) is represented by an additional binary control signal, which switches the accumulators between the addition and subtraction modes, so that the first partial result is subtracted whereas the remaining ones are added. A less common but possible approach is to use this signal as a supplementary address into a memory extended so as to contain values for σ(n)=−1 (negated copies). In FPGA devices, the ROM can be implemented easily and efficiently using look-up tables (LUTs), which are the basic building block, or using external memory banks. Thus different design trade-offs are possible so as to utilize hardware features optimally. It is noteworthy that the DA-based multiplier is composed of four subblocks of the same structure, which differ only in ROM contents. The special structure of the multiplication matrix is not exploited. Fig. 1 . General scheme of DA-based quaternion multiplier.
Reducing memory requirements
A disadvantage of DA is that LUTs can occupy a lot of memory. Even though the problem concerns mainly inner products of lengthy vectors, which is typical for FIR filters, it can be important in case of implementing algorithms that use a number of quaternion multiplications. The necessary memory can be halved by using the offset-binary coding (OBC) (White, 1989) . Its derivation begins by rewriting (5) as
where the overbars represent the bit complements. If we define
which can be −1or+1, then (10) can be rewritten as Using (12), (6) can be rewritten as
Then, reordering terms and substituting (8) give
Obviously, the DA principle can be realized by usingr 1 (
as precomputed partial results and using ρ(x 1,n ),...,ρ(x 4,n ) to address look-up-tables.
and thus only half of 16 possible values ofr 1 (
In exchange, LUT addressing must be quite tricky, accumulators need to be switched between the addition and subtraction modes, as in the scheme in Fig. 2 . Namely, the partial result that corresponds to the n-th bit of input data is selected by using x 2,n , x 3,n ,andx 4,n ,whereas x 1,n decides whether it is added to the accumulator or subtracted. It is also noteworthy that accumulators are initialized with data from outside LUTs: see the term on the right-hand side of (14). This is why the timing diagram in Fig. 2 shows that 17 cycles of the master clock (Clk) are necessary to multiply 16-bit values.
Increasing the speed of DA-based quaternion multiplier
The speed of the straightforward bit-serial implementation of DA may be insufficient for certain real-time applications. In such cases, faster circuits can be developed in which subwords of input data are processed in parallel using more memory and adders (White, 1989) . Namely, L bits can be processed in a single clock period using L LUTs and a L-input accumulator that combines their outputs. Thus, the whole input word is processed in B/L clock cycles, assuming that L is an integer divisor of B, its word length. The possibility of this becomes obvious after rewriting (14) in the following way: The summation over n in (14) is parallelized by breaking it into two sums: the first over m, from 0 to B/L − 1, and the second over l,f r o m0t oL − 1. The former sum is computed iteratively, whereas the latter sum is realized as a single step using replicated hardware. A circuit that processes in such a way L bits at a time is called the L-BAAT scheme. Figures 2 and 3 allow for comparing 1-and 2-BAAT schemes we have developed. In the latter case, it is necessary to use two adders instead of one, and to shift the accumulator contents by 2 bits instead of by 1 bit. Additionally, the word length of the second memory block is shorter by 1bit. (Meher et al., 2009 ). The algorithm consists in approximating a rotation matrix by a product of discrete elementary rotations, each of which is implemented using only binary shifts and additions. The rotation of a point (x 1 , x 2 ) by an angle φ can be described as the multiplication of the corresponding coordinate vector x by the appropriate rotation matrix R(φ)
From another point of view, this equation describes the complex multiplication of x 1 + jx 2 by e jφ . The CORDIC algorithm is based on factorizing the rotation matrix in the following way
where N, σ(n),a n dτ(n) are selected in such a way that the product approximates a scaled rotation by φ, and simultaneously the matrix on the right-hand side describes a transformation that can be implemented using two binary shifts and two additions. The transformation is called a CORDIC iteration, elementary rotation, or microrotation. It not only rotates the point but also increases its norm by 1 + 2 −2τ(n) . In order to counterbalance the latter effect, the output of a sequence of microrotations must be scaled by
which can also be done by using only binary shifts and additions, which form a series of scaling iterations where s(m) ∈ N\{0} are selected so as the product approximates S tot .
In the classical form of CORDIC,
and
This makes scaling independent of the angle, which is advantageous when angle determination is a part of a CORDIC-based algorithm. If the angle is known a-priori, as in case of a constant-coefficient complex multiplication, then by making τ(n) ∈ N independent of n, the same accuracy can be obtained using less microrotations. Figure 4 shows the schemes that realize (a) microrotations and (b) scaling iterations of the 2D-CORDIC. If chip area must be saved, a single switched CORDIC unit can be developed, which is able to realize different microrotations. At the price of occupying more chip area, throughput can be increased by unfolding computations, i.e. by pipelining many units, each of which computes a different microrotation. We are interested mainly in the unfolded architecture, which can be optimized by hardwiring fixed binary shifts. It is also noteworthy that (18) applies only to rotations (angles) for which cos(φ) ≥ 0a n d |cos(φ)|≥|sin(φ)|. Fortunately, the rotation by an unsuitable angle φ can always be realized by simply pre-and post-processing the rotation by a tightly connected suitable angleφ:
In particular,
so it is sufficient to swap values and/or to change their signs. It is also important for us that arbitrary S tot < 1 can be approximated using (20) .
235
Rapid Prototyping of Quaternion Multiplier: From Matrix Notation to FPGA-Based Circuits www.intechopen.com
4D CORDIC
In (Delosme & Hsiao, 1990) , it has been shown that the CORDIC algorithm can be extended to more dimensions and used to implement quaternion multiplications, which consists in the following factorization
Setting the parameters in the conventional way, in accordance with (21) - (23), makes scaling independent of q,as
Four-dimensional CORDIC iterations need to be implemented using four-argument adders, as shown in Fig. 5 . 
CORDIC-Inside-Lifting 3.4.1 Lifting-based quaternion multiplier
Due to word length limitations, results of binary shifts and additions must be truncated. This causes both DA-and CORDIC-based quaternion multipliers to realize data transformations that cannot be reversed in fixed-point arithmetics. Namely, the multiplication by the inverse of the coefficient, or simply by its conjugate in case of a unit-norm coefficient, gives only an approximation of the input value. This is unacceptable for applications like lossless image coding. The issue can be solved by using lifting schemes, which allow reversible integer-to-integer mappings to be realized using finite-precision arithmetic (Calderbank et al., 1998) . Similarly to CORDIC, the idea has been introduced in the context of 2D transformations, but we have extended it to quaternions in (Parfieniuk & Petrovsky, 2010b ). Our idea is based on the structural similarity between R(φ) in (18) and M + (q) in (3a), which becomes evident after rewriting the latter matrix in the following way
This allows for assuming the following factorization
which is inspired by the known three-shear factorization of the 2D rotation matrix . The corresponding scheme is shown in Fig. 6(a) . For a given hypercomplex coefficient q, (33) defines a set of matrix equations, which can be solved uniquely for F(q), G(q),a n dH(q),p r o v i d e dt h a tS(q) is non-singular, or more specifically, non-zero. Namely, the three matrices are given by
so that their elements represent real-valued lifting coefficients. Inverting the triangular matrices in (33) requires only changing the signs of their out-of-diagonal elements (lifting coefficients). Thus, the multiplication by 1/q, or equivalently by q, is realized by reversing the order of the lifting steps and negating their coefficients. Rounding of neither lifting coefficient nor the result of the related multiplication affects invertibility, even though it causes slight deviation of the stage output from that of the original. Fortunately, the issue is acceptable for most applications. In addition to making reversible integer-to-integer mappings possible in fixed-point arithmetic, an inherent property of lifting schemes is that all computations can be performed in-place, i.e. without using auxiliary memory. Another advantage of the proposed lifting-based quaternion multiplier is that 12 real multiplications and 12 real additions are necessary to compute the result. This is 14% less than required by the straightforward matrix-by-vector multiplication (3a).
Embedding CORDIC inside lifting
Alternatively, quaternions can be represented in polar form, using modulus |q| and three angles: φ, ψ,andχ, which can be converted into the rectangular components as follows:
where taking −π ≤ φ < π, −π/2 ≤ ψ ≤ π/2, and −π/2 ≤ χ ≤ π/2 is sufficient to describe all quaternions of a given norm. By assuming |q| = 1 and substituting (35) into (34), we can derive the following simple closed-form equations for lifting coefficient values (Parfieniuk & Petrovsky, 2010b) : where d = sin φ sin ψ. It turns out that there are pairs of lifting coefficients with the same absolute value, which is not obvious from (34) and was not assumed when we constructed the factorization (33). Moreover, F(q), G(q),andH(q) have a structure closely related to that of the rotation matrix
and thus can be modeled using 2D-CORDIC. Only simple postprocessing is necessary, as explained in Fig. 6(b) . In (Parfieniuk & Petrovsky, 2010b) , it is shown that other known lifting factorizations result in computational schemes that are less regular and less efficient. It is also proved that the dynamic range of the proposed scheme can always be limited so as all lifting coefficients have magnitudes not greater than 1. This is achieved by replacing the hypercomplex multiplication by q with the multiplication byq,av e r s i o no fq with exchanged and/or negated parts. The latter operation needs only to be appropriately pre-and postprocessed:
4. Rapid prototyping of quaternion multipliers using Xilinx FPGA 4.1 General methodology Fig. 7 . Development environment based on the evaluation board Xilinx ML401 (Virtex-4 XC4VSX35 FPGA).
The methodology we used for rapid prototyping of quaternion multipliers can be described as the following sequence of steps:
1. Investigating the general idea of a multiplier architecture using rough MATLAB scripts: validating a decomposition of the quaternion product and related matrix factorizations, and then evaluating or determining properties of the corresponding computational scheme.
2. Refactoring and extending MATLAB code so as to prepare reusable functions and scripts, which can be used in different contexts: deeper investigations, low-level coefficient optimization, simulation-based functional verification of hardware designs etc.
3. Repeating Steps 1 and 2 for promising variants of the essential architecture so as to select variants that deserve a deeper investigation.
4. For each variant, determining circuit parameters and synthesizing low-level coefficients, partial results etc. that allow a given constant-coefficient hypercomplex multiplication to be computed with the necessary accuracy.
5. Rejecting parameter-coefficient sets, and possibly variants, which are definitely no match for the others in terms of circuit complexity.
6. Developing detailed MATLAB models of the acceptable hardware multipliers.
7. Initial hardware design using MATLAB code as a reference: VHDL coding and logic synthesis using an FPGA development environment, employing the most advanced of available predefined subcircuits, and generously selecting word lengths.
8. Functional verification of the preliminary designs using FPGA tools and MATLAB scripts. Refining the latter if necessary.
9. Fine-tuning word lengths and subcircuits so as to minimize chip area and maximize throughput of the multipliers.
10. A final comparison of the developed circuits so as to determine that which is best suited to a given application context.
As our goal was to develop optimized circuits and to investigate their properties, and code generators are known to produce highly suboptimal implementations (Haldar et al., 2001) , VHDL code was hand-written. This slows down development but not much as quaternion multipliers are rather simple systems, compared to filters, transforms, etc. Additionally, manual transition from MATLAB to VHDL can be made easier by partitioning MATLAB code so as it reflects a circuit architecture. Rather modest resources are necessary to implement our methodology, as shown in Fig. 7 . Apart form MATLAB, the Leonardo Spectrum and Xilinx ISE software were used in our research. On the other hand, experiments were conducted on the Virtex-4 ML 401 Evaluation Platform, which is powered by the Xilinx XC4VLX25 FPGA device and equipped with a range of memories, industry-standard peripherals, interfaces, and connectors like USB or VGA. Thus the board allows for both studying quaternion multipliers and employing them in practical embedded systems like an image codec.
DA-based circuits
In DA-based multipliers, the output error is essentially determined by the word length and independent of implementation details. Thus we focus on how parallelization of the computations affects throughput, chip area and power consumption. In order to evaluate the available design opportunities, five 16-bit DA-based quaternion multipliers have been prototyped with the parallelism ratio from 1 to 16. The average synthesis results for the Xilinx Virtex v400-4 FPGA are listed in Table 1 . Figures 8 and 9 show the power consumption of DA-based quaternion multipliers as a function of the parallelism ratio and required sampling time, respectively. The analysis of the results shows that the 1-BAAT scheme of the multiplier occupies the smallest chip area, whereas the 8-BAAT scheme consumes the smallest power for a required sampling time. 
4D-CORDIC-based circuit
As an example, a 4D-CORDIC-based quaternion multiplier by q = 1 √ 30
(4 − i + 3j − 2k) has been developed using our methodology for rapid prototyping. In order to provide the desired accuracy, ǫ = 10 −4 , a word length of 16 bits is necessary, and the circuit needs to comprise N = 16 microrotations and M = 7 scaling iterations, which are realized as illustrated in Figs. 5 and 4(b), respectively. The parameters of the stages are listed in Table 2 . We have verified that it is advantageous to modify the microrotations in (28) so as to allow only one of σ k (n) to be nonzero. Without affecting the convergence, this simplifies the circuit as two-operand adders can be used in the scheme in Fig. 5 instead of four-operand ones. Additionally, the expression for the scaling factor changes from (29) to (19). The FPGA implementation occupies 201 Slice Flip-Flops, 554 4-Input LUTs, and 351 Slices, which is about 1% of the resources provied by the Virtex chip. The maximum obtainable clock frequency is 120 MHz. (a) Microrotations:
10 τ (15) 10 (4 − i + 3j − 2k); N = 16 and M = 7.
CORDIC-Inside-Lifting-based multiplier
The CORDIC-Inside-Lifting architecture is most complicated to design. Before prototyping a multiplier by q, all modifications of the hypercomplex number are reviewed that potentially determine computational schemes with all lifting coefficients in the range −1, . . . 1. For every such a scheme, the number of 2D-CORDIC iterations is estimated that guarantees the desired accuracy of computations, and this allows for selecting candidates for fine-tuning parameters and then for FPGA synthesis. (4 − i + 3j − 2k). For multiplications by only 7 of them, all lifting coefficients are in the desired range. And, the first of these 7 quaternions, which is shown the 2nd row of Table 3 ,q = q 3 + q 2 i + q 1 j + q 4 k, evidently can be realized using the lowest number of microrotations, N = N U + N L + N V = 8 at accuracy of an individual CORDIC unit not worse than ǫ = 10 −5 . In order to realize the multiplication by q, the multiplication byq must be preprocessed with Table 3 . Lifting coefficients that result from factorizing the multiplication matrices of modified versions of q = 1 √ 30
(4 − i + 3j − 2k).
1 In (Parfieniuk & Petrovsky, 2010b) , it has been shown that other modifications are redundant. Table 4 shows the parameters of the individual 2D-CORDIC modules of the scheme in Fig. 6(b) . They have been obtained by careful optimization aimed at achieving the total accuracy of a multiplier not worse than ǫ = 10 −4 . The results of FPGA design are worse than for the 4D-CORDIC approach, as about 50% more chip area is occupied. A single 2D-CORDIC circuit needs on average 50% less resources than a 4D-CORDIC unit, but three such units and extra adders are necessary to implement the scheme in Fig. 6(b) . However, only this approach offers the invertibility of multiplication. Table 4 . Parameters of CORDIC-Inside-Lifting-based quaternion multiplier.
Conclusions
There are several approaches to implementing constant-coefficient quaternion multipliers using only binary shifts and additions. Design experiments based on MATLAB, Virtex FPGA device, and our methodology for rapid prototyping, clearly show that each of the presented architectures has both advantages and disadvantages, and thus it is impossible to definitely recommend one of them as best suited for all development situations. From the point of view of rapid prototyping they require similar development effort and time. Distributed arithmetic offers a smooth trade-off between throughput and occupied chip area, whereas in CORDIC, we can only select between iterative or unfolded computations. However, the latter approach allows for maximizing throughput at modest resource utilization. Finally, the CORDIC-Inside-Lifting approach is the only option if lossless processing has to be realized, even though this architecture is neither economical in terms of chip area nor fast.
