 Table. 
I. INTRODUCTION Standard for image compression in current use is the JPEG (Joint Picture Experts Group) and the energy compression technique used in this standard is known as Discrete Cosine Transform
The n rows of an N point DCT matrix T are defined by: 1> For all i=1 to n : (t 1i =√1/n) 2> For all i=1 to n and k=2 to n : (t ki =√2/n cos((∏(2i-1)(2k-1))/2n) The 8 point DCT matrix T (n=8) is: Hence we shall concentrate on it as our example, but bear in mind that DCTs may be defined for a wide range of sizes n. The basic n-point DCT requires n 2 multiplications and n(n-1) additions to calculate y=T * X For an 8x8 matrix that amounts to 64 multiplications and 56 additions.From DCT matrix, it is clear that symmetries exist in the DCT basis functions. These symmetries can be exploited to reduce the computation load of the DCT.
II. THE CLUSTER ARCHITECTURE AND THE ALGORITHM
Here we explored 4 of the known FDCT algorithm namely1>Vetterli's 2> Arai's 3>Loefflers 4>Chen'sAll of them are 1D FDCT algorithm except the third one which is a 2D FDCT algorithm. The clustered processor has 16 clusters with 4 processing elements (PE's) in each cluster. Each PE is having two ALU ohhkwhich can perform simultaneously. The PE's operate in a SIMD manner. The cluster and PE can transfer data from one another. [5] Each algorithm takes a row of an 8x8 matrix and computes the result using 2 clusters. So, all the 16 clusters can be used to compute the 8 rows DCT matrix multiplication paralley so that the entire computation can be done in that time.
III. EXECUTION METHODS OF THE ALGORITHM IN THE CLUSTER PROCESSOR
We will take Vetterli's algorithm as an example to understand how the algorithms can be mapped in the assumed architecture. The entire dataflow design of Vetterli's is decomposed vertically into 6 steps. For each step in the design the small nodes are the processing nodes. The first four processing nodes are mapped to the 4 PE's of cluster 0 and the next four processing nodes are mapped to the 4 PE's of cluster 1. The 6 steps of dataflow design become 6 stages of the cluster processor algorithm. In each step in the dataflow design, any processing node is getting its one operand from the same node and another from some other node in the design. For cluster processor algorithm one operand is simply the stored result of the previous stage. The second operand can be obtained by inter PE communication or inter cluster communication. If the second operand is coming from a PE which is not in the same cluster the first type of data communication is executed else the second one. As the second type of data communication is more time consuming, so we always try to arrange the mapping of processing node to PE to minimize the no of inter cluster communication. The data communication portion of the algorithm is shown below for Stage 5 of Cluster1 of the Vetterli's Algorithm.
Cl ( The rotation operation is mapped in the cluster processor using the following segment of algorithm.
For PE (i) ALU0 CR0*DR1 //alu0=x cos Q TR0 ALU0, ALU1 CR1*DR1 //tr0= xcos Q, alu1=xsin Q TR1 ALU1,ALU0 CR0*DR0 //tr1=xsinQ,alu0=ycosQ ALU1 CR1*DR0 //alu1=ysin Q ALU0 ALU0-TR1 //alu0=ycosQ-xsin Q RES0 ALU0,ALU1 TR0+ALU1 //RES0=Y RES1 ALU1 //RES1=X End For To further understand how this mapping procedure is done, we take two approaches:
A. The sixth processing node of the dataflow design is taken as PE (5) 
IV. OUR PROPOSED ALGORITHM-PARALLELDCT It is assumed that the constants of the DCT matrix are represented as follows:
A=0.3536,B=0.4904,C=0.4157,D=0.2778,E=0.0975,F=0. 4619, G=0.1913 Further it is assumed that the summation of the input matrix is represented as follows: S1=(X 0i +X 1i +X 2i +X 3i +X 4i +X 5i +X 6i +X 7i ) S2=(X 0i -X 1i -X 2i +X 3i +X 4i -X 5i -X 6i +X 7i ) S3=(X 0i -X 7i ) S4=(X 1i -X 6i ) S5=(X 2i -X 5i ) S6=(X 3i -X 4i ) S7=(X 0i -X 3i -X 4i +X 7i ) S8=(X 1i -X 2i -X 5i +X 6i ) So replacing these variables the original 8x8 DCT equation produces: i= 0 to 7 Y 0i =S1*A Y 1i =S3*B+S4*C+S5*D+S6*E 
We do not explore the symmetries further considering additions and subtractions. All the PEs have the local memory so adding a set of data from local memory is less complex than inter and intra cluster data communication and each of these take the same clock pulse (i.e. 1). [5] 
Cluster0
The dataflow of the proposed algorithm is following: (2) From the findings it becomes clear that the new simple algorithm of computing DCT performed better than the other ones because of its high rate of parallelism inherent in its operation. The percentage utilization of the cluster processor is also significantly higher than the other algorithms. The other algorithms in the process of trying to minimize the no of multiplications have actually become lower in terms of % utilization of the processor.
