In this paper, we offer and discuss three efficient structural solutions for the hardware-oriented implementation of discrete quaternion Fourier transform basic operations with reduced implementation complexities. The first solution -a scheme for calculating sq product, the second solution -a scheme for calculating qt product, and the third solution -a scheme for calculating sqt product, where s is a so-called i -quaternion, t is an j -quaternion, and q -is an usual quaternion. The direct multiplication of two usual quaternions requires 16 real multiplications (or two-operand multipliers in the case of fully parallel hardware implementation) and 12 real additions (or binary adders). At the same time, our solutions allow to design the computation units, which consume only 6 multipliers plus 6 two input adders for implementation of sq or qt basic operations and 9 binary multipliers plus 6 two-input adders and 4 four-input adders for implementation of sqt basic operation.
Introduction
Two dimension discrete Fourier transform (2D-DFT) have been widely used in image processing ever since the discovery of Fast Fourier transform (FFT) which made the computation of DFT feasible using a computer [1] . However, if we want to apply the classical 2D-FFT to color images, we must perform three separate 2D-FFTs. This is because every color image pixel has three values associated with it: the red, green, and blue components.
Until recently, it was not offered any discrete transform, which would perceive the each pixel of the color image as a whole. Sangwine and Ell defined a new transform, called the discrete quaternion Fourier transform (DQFT) which allows to process simultaneously of all color components of the image [2] [3] [4] . The idea of DQFT is based on representation of color image pixels via quaternions -four-dimensional hypercomplex numbers discovered by Hamilton in 1843. Today there are several ways to calculate the 2D-DQFT [5] [6] [7] [8] [9] [10] [11] [12] [13] . Another way for calculating the QDFT has been reported in [14] . While it produces the same result as the other approaches, it is more efficient because leads to reducing the computation complexity. During the DQFT implementation using the method proposed in [14] , it is necessary to perform three types of the quaternion multiplication operation, namely: left-sided quaternion multiplication, right-sided quaternion multiplication and two-sided quaternion multiplication. What is more, in all three cases only one quaternion is a usual quaternion, and the rest quaternions are constant quaternions, i.e. quaternions, which coefficients are real constants. Next we propose hardware-effective schemes to implement these operations.
Statement of the problem
The typical operations of the related two dimensional forward and inverse discrete quaternion Fourier transform are [14] :
-left-sided quaternion multiplication sq , -right-sided quaternion multiplication qt and -two-sided quaternion multiplication sqt ,
with three imaginary units
, and
are so-called  i quaternion and  j quaternion respectively [14] , and
-are real constants. During synthesis of the discussed schemes we use the fact that multiplication of two quaternions may be represented as vectormatrix product [15, 16] . The matrix that participates in the product calculating has unique structural properties that allow performing its advantageous factorization [17] . Namely this factorization leads to significant reducing of the computational complexity of quaternion multiplication. Furthermore, since s and t are truncated quaternions and are in fact complex numbers, which located in the different domains of complex space, the corresponding matrices are sparse. This leads to additional effect in minimization of the computational complexity. Finally, an additional effect can be achieved using the fact that the numbers  ,  ,  and  are real constants and their products can be calculated and stored in memory in advance.
The schemes
-be a column vector, that contains the all coefficients of quaternion q , and
-be a column vector containing the elements of sq product.
The first scheme (for implementation sq -kernel) can be written with the help of following matrix-vector calculating procedure: Fig. 2 shows a data flow diagram of the proposed scheme for implementation qt product kernel. 
Conclusion
The article presents three new hardware-efficient schemes for the execution sq -product, qt -product and sqt -product kernels with reduced computational complexities. To reduce the hardware complexity (number of embedded adders and multipliers), we exploit the specific structural properties of the matrix-vector products that represent mentioned basic operations. So, the fully parallel implementation of sq -product and qt -product kernels require only 6 multipliers by real numbers, and 6 adders. In turn, a fully parallel implementation of sqt -product kernel requires only 9 binary multipliers, 6 two-input adders and 4 four-input adders.
Reducing the number of multiplications is especially important in the design of specialized VLSI on-board DSP processors because minimizing the number of necessary multipliers also reduces the power dissipation and lowers the cost implementation of the entire system being implemented. This is because a hardware multiplier is more complicated unit than an adder and occupies much more chip area than the adder. (It is proved that the hardware complexity of an embedded multiplier grows quadratically with operand size, while the hardware complexity of an binary adder increases linearly with operand size). Even if the VLSI chip already contains embedded multipliers, their number is always limited. This means that if the implemented scheme has a large number of multiplications, the projected processor may not always fit into the chip and the problem of minimizing the number of multipliers remains relevant.
This problem becomes extremely challenging for applications requiring real-time processing at high throughput especially in digital signal and image processing. Hence, for meeting the high requirements to throughput and power-consumption constraints of real-time image processing systems, developing hardware-efficient schemes to implement them on the base of application specific integrated circuits (ASICs) or field-programmable gate arrays (FPGAs) is of paramount importance. 
