Vector quantization algorithm based on fuzzy clustering has been widely used in the field of data compression since the use of fuzzy clustering analysis in the early stages of a vector quantization process can make this process less sensitive to its initialization. However, the process of fuzzy clustering is computationally very intensive because of its complex framework for the quantitative formulation of the uncertainty involved in the training vector space. To overcome the computational burden of the process, this paper introduces an array architecture for the implementation of fuzzy vector quantization (FVQ). The arrayarchitecture, which consists of 4,096 processing elements (PEs), provides a computationally efficient solution by employing an effective vector assignment strategy during the clustering process.
Introduction 1)
Vector quantization (VQ) is a classical quantization technique that allows the modeling of probability density functions by the distribution of prototype vectors. VQ identifies an input vector with a member of a codebook which is a collection of codeword vectors. The encoding process replaces each constituent input block with its corresponding VQ codeword index. However, the traditional VQ method makes this process more sensitive to initialization for achieving the quality of vector quantizers. However, the process of fuzzy clustering is computationally very intensive because of its sophisticated framework for the quantitative formulation of the uncertainty involved in a training vector space.
To overcome the computational burden of the complex process, this paper introduces a parallel implementation of the fuzzy vector quantization (FVQ) algorithm using a representative data parallel architecture which consists of 4,096 processing elements (PEs). Thus, the proposed parallel approach provides a computationally efficient solution by employing an effective vector assignment strategy for the transition from soft to crisp decisions during the clustering process. This paper evaluates the impact of the parallel FVQ implementation on both the processing performance and energy efficiency. In addition, this paper compares the proposed parallel implementation to other implementations using commercial processors. Experimental results show that the parallel implementation provides about 1000x greater performance and 100x higher energy efficiency than other implementations using commercial ARM [7] and TI DSP [9] .
The rest of this paper is organized as follows. Section II presents fuzzy clustering based vector quantization for data compression, a brief introduction of the baseline data parallel architecture, and methodology infrastructure for the parallel FVQ implementation. Section III introduces our proposed parallel FVQ implementation using the specified data parallel architecture. Section IV analyzes and compares the performance and energy efficiency for the parallel implementation and the implementations using ARM families. Section V concludes this paper.
Background Information

Fuzzy Vector Quantization
This section briefly reviews key features of vector quantization (VQ) and fuzzy clustering [8] . X={x1, x2,…, xn} be the pixel intensity where n is the number of image pixels to determine their memberships.
The FCM clustering performs to partition the data set X into c clusters, and the objective function of the standard FCM is defined as follows: The data point xk belongs to a specific cluster viwhich is given by the membership value uik of the data point to that cluster. Local minimization of the objective function Jm(U,V) is accomplished by repeatedly adjusting the values of uik and vi according to the following equations: As Jmis iteratively minimized, vi becomes more stable.
Iteration of pixel clustering is terminated when the termination measurement
where vi (t) is new centers, vi (t-1) is previous centers, and ε is the predefined termination threshold.
In the codebook design using FVQ, the input model consists of a set of training vectors X which are weighted with uik. In addition, the training vectors are mapped into clusters which are represented by codewords V.
After several iteration processes, a high quality of codebook is generated. However, this demands tremendous computational requirements. To overcome this problem, we prefer to implement a parallel FVQ using a representative data parallel architecture which consists of 4,096 processing elements (PEs).
Data Parallel Architecture
Among many computational models available for imaging applications, single instruction multiple data (SIMD) processor arrays are promising candidates for 2-dimensional image processing algorithms since they often employ thousands of processing elements (PEs) while possibly distributing and co-locating PEs with the data I/O to minimize storage and data communication requirements.
( Figure 1) shows the baseline data parallel architecture along with its interconnection network. When data are distributed, the processing elements (PEs) execute a set of instructions in a lockstep fashion. With 4x4 pixel sensor sub-arrays, each PE is associated with a specific portion (4x4 pixels) of an image frame, allowing streaming pixel data to be retrieved and processed locally. Each PE has a reduced instruction set computer (RISC) datapath with the following minimum characteristics:
•ALU-computes basic arithmetic and logic operations, Simulator (GENESYS) [11] to calculate technology parameters (e.g., latency, area, power, and clock frequency) for each configuration. Finally, we combine the database (e.g., cycle times, instruction latencies, instruction counts, area, and power of the functional units), obtained from the application, architecture, and technology levels, to determine execution times, area efficiency, and energy efficiency for each case. 
Parallel Fuzzy C-Means Clustering Algorithm
The pseudocode of the parallel FCM algorithm is given in (Figure 3 ), along with a pictorial description of the FCMmechanism and of the communication patterns for ahypothetical 16 node SIMD array system. Each system node is directly interfaced to a 4 x 4 pixel matrix. In step 1, each node computes distance between input pixels and the current center to determine their membershipvalues, new centers, and the termination measurement value. If the termination measurement value of distortion is less than the threshold value, current codeword will be replaced with the new center value. In step 2, the 16 components of the membership values are passed to the next neighbor in the same row; the new termination measurement value is computed; and then it is compared against the threshold value toreplace with the new center value in case the termination value is less than the threshold. This process is iterated until every node in the same row has been visited by the membership values.
When the row completes, the membership values are transferred vertically to the next row in step3, and the same process is iterated along the row in step 4. A key enabling role is played by the toroidal structure of the interconnection network, which enables the communications among the nodes in the opposite of the mesh.
Parallel Vector Quantization Algorithm
Using the codebook generated by the FCM algorithm in Section 3.1, the parallel encoding operation of a vector quantizer is implemented. The pseudocode of the parallel encoding algorithm is given in (Figure 4 ) As a first step, each node computes the distortion between the input blockand the codeword. In a second step, the 16 components of the input block, the distortion value, and the index of the codeword are passed to the neighbor in the This process is iterated until every node in the same row has been visited by the input block. When the row completes, the data are transferred vertically to the next row, and the same process is iterated along the row.
Simulation Results
In this section, the performance evaluation of the parallel FVQ implementation is presented. We evaluated the parallel implementation with the following parameters: (1) the degree of fuzzification m=2, (2) the termination threshold E=0.0001, and (3)the number of codewords C=2 4 8; 16; 32; 64; 128; 256 (codeword size 4x4 pixels). In addition, a cycle-accurate simulator was used to simulate and evaluate the performance of the parallel FVQ with eight different codewords, where the parallel FVQ algorithms were developed in their respective assembly languages for the PE array system. In this study, the image size of 256x256 pixels was used. For a fixed 256x256 pixel system, the number of 4,096 PEs is used because each PE contains 4x4 pixels. <Table 1> summarizes the parameters of the PE array system configuration.
The metrics of execution time, sustained throughput, and energy efficiency of each case form the basis of the study comparison, defined in <Table 2>. where C is the cycle count, fck is the clock frequency,
Oexec is the number of executed operations, U is the PE utilization, and NPE is the number of processing elements.
<Table 3> summarizes the performance of the FCM algorithm for generating differentnumber of codewords and the VQ encoding processin terms of execution time, system utilization, sustained throughput, PSNR, and compressed rate parameters where the system utilization is calculated as the average number of active processing elements. processors. This is because SIMD130 achieves higher sustained throughputs with a small increase in the system power. Increasing energy efficiency improves sustainable battery life for given system capabilities. Moreover, our parallel implementation provides 1000x better performance than the ARM processors.
Performance Comparison with other Array Systems
( Figure 7 ) graphically shows the throughput-energy efficiency for each case. 
Conclusion
In this paper, we have presented and evaluated the impact of our proposed parallel FVQ implementation using a specified data parallel architecture in terms of the performance and energy efficiency. Experimental results
show that the proposed parallel implementation provides greater performance and efficiency than appropriately scaled alternative parallel systems. In addition, the proposed parallel implementation provides 1000x greater performance and 100x higher energy efficiency than other implementations using today's ARM and TI DSP processorswith the same 130nm technology. These results suggest that he proposed parallel implementation has the potential for the improved performance and energy efficiency.
