Systolic Architecture for Implementation of 2-D Discrete Sine Transform  by Murty, M.N. et al.
 Procedia Engineering  30 ( 2012 )  441 – 448 
1877-7058 © 2011 Published by Elsevier Ltd.
doi: 10.1016/j.proeng.2012.01.883 
Available online at www.sciencedirect.com
 
International Conference on Communication Technology and System Design 2011
Systolic Architecture for Implementation of 2-D Discrete Sine 
Transform  
M.N. Murtya, S.S. Nayakb, B. Padhyc, S.N. Pandad, a* 
 
aDepartment of Physics, NIST, Berhampur-761008, Orissa, India 
bDepartment of Physics, JITM, Centurion University, Paralakhemundi-761200, Orissa, India 
cDepartment of Physics, Khallikote (Auto) College, Berhampur-760001, Orissa, India 
dDepartment of Physics, Gunupur College, Gunupur-765022, Orissa, India 
Abstract 
 
 In this paper, a recursive algorithm and two linear systolic architectures for realizing the one-dimensional discrete 
sine transform (DST) are presented. By using some mathematical techniques, any general length DST can be converted into 
a recursive equation. The recursive algorithms apply to arbitrary length algorithms and are appropriate for VLSI 
implementation. These two linear arrays have been utilised for designing a bilayer structure for computing the 2-D DST. 
This bilayer structure does not require any hardware / time for the transposition of the intermediate results. The desired 
transposition is achieved by orthogonal alignment of the linear array of the upper layer with respect to those of the lower 
layer. 
 
© 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of ICCTSD 2011 
 
Keywords: Discrete sine transform; Discrete cosine transform; Recursive; Systolic architecture 
1. Introduction 
The Discrete sine transform (DST) was first introduced to the signal processing by Jain[1], and several 
versions of this original DST were later developed by Kekre et al.[2], Jain[3] and Wang et al.[4]. There exist 
four even DST’s and four odd DST’s, indicating whether they are an even or an odd transform[5]. Ever since the 
introduction of the first version of the DST, the different DST’s have found wide applications in several areas in 
Digital signal processing (DSP), such as image processing[1,6,7], adaptive digital filtering[8] and 
interpolation[9]. The performance of DST can be compared to that of the discrete cosine transform (DCT) and it 
may therefore be considered as a viable alternative to the DCT. Yip and Rao[10] have proven that for large 
sequence length (N  32) and low correlation coefficient (U < 0.6), the DST performs even better than the DCT. 
 
In this paper, an algorithm to convert DST into a recursive form and two systolic architectures for 
parallel computation of DST are presented. Then these two linear arrays have been utilised for designing a 
bilayer structure for computing the 2-D DST. The proposed bilayer structure provides high throughput of 
computation due to fully pipelined processing and massive parallelism employed in this structure. 
 
              The rest  of the paper is organised as follows.  The  recursive algorithm for DST is presented in 
Section-2. Two systolic architectures for computation of DST are presented in Section-3. The comparison of 
linear array-I and -II with other research works is presented in Section-4. The bilayer systolic architecture for 
implementation of 2-D DST is presented in Section-5. Finally we conclude our paper in Section-6.
2.Proposed recursive algorithm for DST 
 Let X(n), 1  n  N, be the input data array. Then the type-II DST is defined as  
                                                 
 
* M.N.Murty, Cell : 7205171284,  
E-mail: mnarayanamurty@rediffmail.com 
Open access under CC BY-NC-ND license.
Open access under CC BY-NC-ND license.
442   M.N. Murty et al. /  Procedia Engineering  30 ( 2012 )  441 – 448 
 X(k) = 
 
»¼
º
«¬
ª ¦
 N
kn
nxC
N
n
k 2
12sin)(
1
S
      (1) 
 
for k = 1,2,…, N 
 
Where, 
 
¯
®
­
 
 
 
1-N 1,2,..., k for  1
Nkfor 
C 2
1
k  
Without loss of generality, all these scale factors may be ignored in the rest of the paper. Therefore,  
     12
1

 
¦ nk
N
n
SnxkX   for (2n-1)  1                   (2) 
and 
    »¼
º
«¬
ª  
N
kn
S nk 2
12sin12 S                     (3) 
From equation (3), we have 
 
 
»¼
º
«¬
ª
»¼
º
«¬
ª   
N
k
N
kn
SS nk
n
k 2
sin
2
22cos23212 SS                   (4) 
 
 
»¼
º
«¬
ª
»¼
º
«¬
ª   
N
k
N
kn
SS nk
n
k 2
sin
2
42cos25232 SS                   (5) 
Subtracting (5) from (4), we have 
»¼
º
«¬
ª »
¼
º
«
¬
ª
¸
¹
·¨
©
§   k)n(
N
Sin
N
k
sinSSS nk
n
k
n
k 3222
42
2
523212 SS   
=   32214  nkk SS   
Hence,   52322112 22  »¼
º
«¬
ª  nk
n
kk
n
k SSSS                  (6) 
From (2) and (6) the complete recurrence formula may be obtained as 
 Zk(0) = 0 
     111 kk SxZ   
      112   nZSnxnZ knkk                   (7) 
    kXNZ k           
and 
 523212   nk
n
kk
n
k SSAS                    (8) 
where 
  > @2122 kk SA           
443 M.N. Murty et al. /  Procedia Engineering  30 ( 2012 )  441 – 448 
3. Proposed linear systolic arrays for implementing the DST 
3.1  Linear array-I 
 The I-D DST is released in the linear systolic array shown in Fig. 1. The recursive relations given by 
(7) and (8) are implemented in this linear array. It consists of N locally connected identical processing elements 
(PEs). Function of each PE is described in Fig. 2. The sine kernels for the first PE are generated by a kernel 
generating processor, while each PE computes the sine kernel values for the next PE. One element of the input 
data is fed and stored in each PE in sequence which computes the partial transform result. These two processes 
are done in parallel in each PE. The kth PE computes the transform component X(k). The transformed output is 
obtained from the Nth PE. The first output is obtained after N time-steps and the rest (N-1) output are obtained 
in subsequent (N-1) time-steps. However, successive sets of N-point DST are obtained in every N time-steps. 
 
Each PE comprises of two multipliers and two adders. Also each PE has one accumulator and requires 
four latches for transfer of data as well as kernel values. This approach requires (2N-1) multiplications per point 
and (2N-3) additions per point for realisation of N length DST.  In each PE two multiply - add 
operations are performed in parallel. Since these two processes are done in parallel, only one addition and one 
multiplication time are consumed. The duration of the cycle period is T = TM + TA, where TM and TA are, 
respectively, the times involved in performing one multiplication and one addition in the PE. 
3.2  Linear array-II 
 
 The structure of the proposed linear array-II is shown in Fig. 3. It consists of N locally connected 
identical PEs. This array may also be used for computing the 1-D DST according to the recurrence relations 
given by (7) and (8). Function of each PE is described in Fig. 4. The sine kernels required by recurrence relation 
(8) are generated in the PEs. The corresponding initial values of sine kernels are stored in different PEs which 
generates the successive values in successive time-steps. The kth PE computes the transform component X(k). 
The first output is obtained after N time-steps and the rest (N-1) output are obtained in subsequent (N-1) time-
steps. Successive sets of N-point DST are obtained in every N time-steps. 
 
 In each PE two multiply - add operations are performed in parallel. Therefore, each PE comprises of 
two multipliers and two adders. Also each PE requires one accumulator and five additional storage for 
generation of kernels. In addition, each PE requires one latch for transfer of data.  
 
Fig.1. Linear array – I for computing N – point 1-D DST. 
 
 
Fig. 2.  Function of each PE of linear array-I 
444   M.N. Murty et al. /  Procedia Engineering  30 ( 2012 )  441 – 448 
Computed Output 
Fig. 3.  Linear array – II for computing N – point 1-D DST 
If (Count = N) then 
Count = 0 
Y out  m A 
A m 0 
Sc m 0 
S m 1kS  
 
 
If (1 d Count d N), then  
A m A + X in S 
T m S 
S m Ak S-Sc 
Sc m T 
X out m X in 
Count m Count + 1 
end if. 
Fig. 4.  Function of each PE of linear array - II
4. Comparisons of linear array-I and -II with related works 
The proposed approach requires (2N-1) multiplications per point, and (2N-3) additions per point for the 
realisation of N length 1-D DST. 
 
 In Tables 1 and 2, the number of multipliers and the number of adders in the proposed linear array are 
compared with the corresponding parameters based on the other methods. 
 
 Table 3 gives the comparison of the computation complexities of the proposed algorithm with other 
algorithms found in the related research works.  
 
 The hardware - and time-complexities of the proposed systolic realisation along with those of the 
existing structures [27] - [31] are listed in Table 4. 
 
 
 
 
 
 
 
 
 
 
 
445 M.N. Murty et al. /  Procedia Engineering  30 ( 2012 )  441 – 448 
Table 1. Comparison of the number of multipliers required by different algorithms  
N [11] [13] [17] [19,20,23] [21] [12] [26] [22] Proposed
4 6 5 5 4 11 2 5 4 7 
8 16 13 13 12 19 8 13 8 15 
16 44 35 33 32 36 30 29 16 31 
32 116 91 81 80 68 54 61 32 63 
64 292 227 193 192 132 130 125 64 127 
 
Table 2. Comparison of the number of adders required by different algorithms  
N [17] [13] [19,20,23] [11] [12] [21] [26] [22] Proposed 
4 9 9 9 8 4 11 14 7 5 
8 35 29 29 26 22 26 26 15 13 
16 95 83 81 74 62 58 50 31 29 
32 251 219 209 194 166 122 98 63 61 
64 615 547 513 482 422 250 194 127 125 
 
Table 3. Computation complexities 
 of multiplications of additions 
Proposed  2N-1 2N-3 
[13] (3/4) N log2N - N + 3 (7/4) N log2N - 2N + 3 
[14,16,20,23]  (1/2) N log2N (3/2) N log2N - N + 1 
[15,24,25] N log2N /2 + 1 3 N log2 N / 2 -N +1 
[18] (1/2) N log2N + (1/4) N-1 (3/2) N log2N + (1/2) N-2 
[21] 2(N+3)(N-1) / N 2(2N-1)(N-1) / N 
[22] (N+1)(N-1) / N (2N+1)(N-1) / N 
[26] if N is even 2N-3 3N+2 
[26] if N is odd 2(N-1) 3N+4 
 
Table 4. Hardware - and time-complexities of proposed structure and the existing systolic structures for the DST / DCT 
Structures Multipliers Adders Cycle-Time (T) Average Computation - Time 
Pan and Park [27] N 2N TM + TA NT/2 
Fang and Wu [28] N/2 + 3 N + 3 TM + 2TA NT 
Chiper et al. [29] N-1 N+1 TM + TA (N-1) T/2 
Meher [30] N/2 - 1 N/2 + 9 2(TM + TA) (N/4-1) T 
Meher [31] N/2 + 3 N/2 +5 TM + TA (N/2-1) T 
Proposed 2N-1 2N - 3 TM + TA NT 
 
5. Proposed bilayer architecture for implementation of 2-D DST 
The 2-D DST of a N1 u N2 array [x(n1, n2)] may be defined as 
446   M.N. Murty et al. /  Procedia Engineering  30 ( 2012 )  441 – 448 
        »
¼
º
«
¬
ª S
»
¼
º
«
¬
ª S ¦¦
  2
22
1
11
21
11
21 2
12sin
2
12sin
1
1
2
2
N
kn
N
kn
n,nxk,kX
N
n
N
n
               (9) 
for k1 = 1,2,…, N1 and k2 = 1,2,…, N2 
Equation (9) may be expressed as 
      »
¼
º
«
¬
ª 
 ¦
 2
22
21
1
21 2
12
sin,,
2
2
N
kn
nkWkkX
N
n
S
             (10) 
where 
      »
¼
º
«
¬
ª 
 ¦
 1
11
21
1
21 2
12
sin,,
1
1
N
kn
nnxnkW
N
n
S
                           (11) 
Equations (10) and (11) imply that [X(k1, k2)] is computed in two distinct stages. In the first stage, an 
intermediate result [W(k1, n2)] is computed from each column of the input matrix [x(n1, n2)] of size N1 u N2. In 
the second stage, [X(k1, k2)] is computed from each row of the intermediate result [W(k1, n2)]. 
The proposed systolic architecture for the implementation of 2-D DST, shown in Fig. 5, consists of two 
layers of systolic arrays placed one over the other.  The lower  layer consists of N2 number of linear array-II 
(Fig. 3) and the upper layer consists of N1 number of linear array-I (Fig. 1). Each linear array-II consists of N1 
number of locally connected identical PEs (Fig. 4) while each linear array-I consists of N2 number of locally 
connected identical PEs (Fig. 2). The lower layer makes the first stage of computation to provide the 
intermediate result [W(k1, n2)] to the upper layer. The upper layer makes the second stage of computation to 
yield the desired DST components. The linear arrays of the upper layer are placed orthogonally with respect to 
the linear arrays of the lower layer, such that each PE in the upper layer is placed over a PE of the lower layer. 
Each PE in the upper layer, therefore, receives its desired input in proper sequence of time. 
Table 5 gives the comparison of area-complexity, computation time and VLSI performance measure of 
the proposed bilayer structure with the structures of [32] and [33] 
 
Table 5 
Structure Area complexity(A) Computation time (W) VLSI performance measure (AW2) 
Proposed structure N2 2N 4N4
Structure of Lee and Huang [32] 3N2 2N 12N4
Structure of Gou et al. [33] (N4-1)/2 (N2+1)/2 (N8+2N6-2N2-1)/8
447 M.N. Murty et al. /  Procedia Engineering  30 ( 2012 )  441 – 448 
Fig. 5. Bilayer systolic architecture for implementation of 2 – D DST. 
6. Conclusion 
 In this paper, we proposed a recursive algorithm and two linear systolic arrays for computing the 1-D 
DST. These linear arrays involve less number of multipliers compared with some of the existing structures. Also 
the numbers of adders in these arrays are significantly less compared with many existing structures. These linear 
arrays have been utilized for designing the bilayer systolic architecture for implementation of 2-D DST. This 
bilayer structure does not require any hardware / time for transposition of intermediate results. It provides high 
throughput of computation due to fully pipelined processing and massive parallelism. It is observed that the 
proposed bilayer structure involves significantly less area-complexity, less computation time and provides better 
VLSI performance measure compared with other two structures [32] and [33]. 
References 
 [1] Jain AK. A fast Karhunen-Loeve transform for a class of random processes. IEEE Trans. Commun., COM-24, September               
                     1976; 1023-1029. 
 [2] Kekre HB, Solanka JK. Comparative performance of various trigonometric unitary transforms for transform image coding.  
                      Int. J. Electron., 44, 1978; 305-315. 
 [3] Jain AK. Sinusoidal family of unitary transforms. IEEE Trans. Patt. Anal. Machine Intell., PAMI-I, September 1979; 356- 
                       365. 
 [4] Wang Z, Hunt B. The discrete W transform. Applied Math Computat., 16, January 1985; 19-48. 
 [5] Poornachandra S, Ravichandran V, Kumarvel N. Mapping of discrete cosine transform (DCT) and discrete sine transform  
                     (DST) based on symmetries. IETE Journal of Research, 49(1), January-February 2003; 35-42. 
 [6] Cheng S. Application of the sine transform method in time of flight positron emission image reconstruction algorithms. IEEE  
                      Trans. BIOMED. Eng., BME-32, March 1985; 185-192. 
 [7] Rose K, Heiman A, Dinstein I. DCT/DST alternate transform image coding. Proc. GLOBECOM 87, 1, November 1987; 426- 
                     430. 
 [8] Wang JL, Ding ZQ. Discrete sine transform domain LMS adaptive filtering. Proc. Int. Conf. Acoust., Speech, Signal  
                     Processing, 1985; 260-263. 
 [9] Wang Z, Wang L. Interpolation using the fast discrete sine transform. Signal Processing, 26, 1992; 131-137. 
 [10] Yip P, Rao KR. On the computation and the effectiveness of discrete sine transform. Comput. Electron., 7, 1980; 45-55. 
 [11] Chen WH, Smith CH, Fralick SC. A fast computational algorithm for the discrete cosine transform. IEEE Trans.  
                        Communicat., COM-25(9), September 1977; 1004-1009. 
 [12] Yip P, Rao KR. A fast computational algorithm for the discrete sine transform. IEEE Trans. Commun., COM-28, February  
                        1980; 304-307. 
 [13] Wang Z. Fast algorithms for the discrete W transform and for the discrete Fourier transform. IEEE Trans. Acoust., Speech,  
                       Signal Processing, ASSP-32, August 1984; 803-816. 
 [14] Yip P, Rao KR. Fast decimation-in-time algorithms for a family of discrete sine and cosine transforms. Circuits, Syst.,     
                       Signal Processing, 3, 1984; 387-408. 
448   M.N. Murty et al. /  Procedia Engineering  30 ( 2012 )  441 – 448 
[15] Hou HS. A fast recursive algorithm for computing the discrete cosine transform. IEEE Trans. Acoust., Speech, Signal  
             Processing, ASSP-35(10), October 1987; 1455-1461. 
[16] Ersoy O, Hu NC. A unified approach to the fast computation of all discrete trigonometric transforms. In Proc. IEEE Int.  
            Conf. Acoust., Speech, Signal Processing, 1987; 1843-1846. 
[17] Malvar HS. Corrections to fast computation of the discrete cosine transform and the discrete hartley transform. IEEE Trans.  
            Acoust., Speech, Signal Processing, 36(4), April 1988; 610-612. 
[18] Yip P, Rao KR. The decimation-in-frequency algorithms for a family of discrete sine and cosine transforms. Circuits, Syst.,  
             Signal Processing, 7(1), 1988; 3-19. 
[19] Gupta A, Rao KR. A fast recursive algorithm for the discrete sine transform. IEEE Transactions on Acoustics, Speech and  
             Signal Processing, 38(3), March 1990; 553-557. 
[20] Cvetkoviü Z, Popoviü MV. New fast recursive algorithms for the computation of discrete cosine and sine transforms. IEEE
             Trans. Signal Processing, 40(8), August 1992; 2083-2086. 
[21] Caranis J. A VLSI architecture for the real time computation of discrete trigonometric transform. J. VLSI Signal Process., 5,  
             1993; 95-104. 
[22] Chau LP, Siu WC. Recursive algorithm for the discrete cosine transform with general lengths. Electronics Letters, 30 (3),  
             February 1994; 197-198. 
[23]Lee P, Huang FY. Restructured recursive DCT and DST algorithms. IEEE Transactions on Signal Processing, 42(7), July  
            1994; 1600-1609. 
[24] Britanak V. On the discrete cosine computation. Signal Process., 40(2-3), 1994; 183-194. 
[25] Kok CW. Fast algorithm for computing discrete cosine transform. IEEE Trans. Signal Process., 45, March 1997; 757-760. 
[26] Kober V. Fast recursive algorithm for sliding discrete sine transform. Electronics Letters, 38(25), December 2002; 1747- 
             1748.  
[27] Pan SB, Park RH. Unified systolic array for computation of DCT / DST / DHT. IEEE Trans. Circuits Syst. Video Technol.,  
            7(2), April 1997; 413-419. 
[28]Fang WH, Wu ML. Unified fully-pipelined implementations of one- and two-dimensional real discrete trigonometric  
            transforms. IEICE Trans. Fund. Electron. Commun. Comput. Sci., E82-A (10), October 1999; 2219-2230. 
[29] Chiper DF, Swamy MNS, Ahmad MO, Stouraitis T. A systolic array architecture for the discrete sine transform. IEEE trans.  
            Signal Process., 50(9), September 2002; 2347 - 2354. 
[30] Meher PK. A new convolutional formulation of the DFT and efficient systolic implementation. In Proc. IEEE Int. Region 10  
             Conf. (TENCON’05), November 2005; 1462-1466. 
[31] Meher PK. Systolic designs for DCT using a low-complexity concurrent convolutional formulation. IEEE Trans. Circuits &  
            Systems for Video Technology, 16(9), September 2006; 1041-1050. 
[32] Lee PZ, Huang FY. An efficient prime-factor algorithm for the discrete cosine transform and its hardware implementations.  
             IEEE Trans. Signal Processing, 42(8), August 1994; 1996-2005. 
[33] Guo JI, Liu CM, Jen CW. A new array architecture for prime-length discrete cosine transform. IEEE Trans on signal  
               processing, 41(1), January 1993; 436-442. 
