Currently, performance analysis of multimedia-MPSoC platforms largely rely on simulation. The execution of one or more applications on such a platform is simulated for a library of test video clips. If all specified performance constraints are satisfied for this library, then the architecture is assumed to be well-designed. This is similar to testing software for functional correctness. However, in contrast to functional testing, simulating a set of video clips for a complex application/architecture is extremely time consuming. In this paper we propose a technique for clustering a library of video clips, such that it is sufficient to simulate only one clip from each cluster rather than the entire library. Our clustering is scalable, i.e., the number of clusters may be determined based on the number of clips that the system designer wishes to simulate (which is independent of the input library size). For each video clip in the library, we perform a fast bitstream analysis from which the workload generated while processing this clip on the given architecture may be estimated. This workload information, in conjunction with a workload model and a performance model of the architecture, is used for the clustering. This entire process does not involve any simulation and is hence extremely fast. We illustrate its utility through a detailed case study using an MPEG-2 decoder application running on an MPSoC platform. As part of validation of our methodology, it was observed that video clips falling into the same cluster exhibit similar worst case buffer backlogs and worst case delays for one macroblock. Overall the results demonstrate that the proposed method provides a very fast and accurate analysis and hence can be of significant benefit to the system designer.
INTRODUCTION
Simulation-based system performance analysis is a very widely adopted methodology for multimedia-MPSoC platforms. In the context of a video processing application such as an MPEG-2 decoder, these simulations take a library of test video clips as input. When simulated with this library, the MPSoC platform is considered to be appropriately designed if it behaves in accordance to all the performance constraints. It is analogous to the common software functional testing methodology. However, unlike in the software testing scenario, the simulation of MPEG-2 decoder application with the library of video clips is very expensive with respect to time. As mentioned in an earlier work [13] , it may take tens of hours for the simulation of only a few minutes of video in a decoding application. This is mainly due to the heterogeneous and complex nature of multimedia MPSoC architectures like the Eclipse template from Philips [12] . Therefore, the performance analysis time for such architectures steeply increases with the input library size.
In order to reduce the performance analysis time, there have been many efforts in the past [2, 6, 7] to identify representative test inputs. They classify the test inputs into well defined subsets with minimum correlation. However, many of these works were in the area of microprocessor design and the test input characteristics used for classification were instructions per cycle (IPC), cache miss rates, branch misprediction rates etc. A detailed description of the related work will be presented in the next section. Next, we highlight our contributions in speeding up the performance analysis of multimedia-MPSoC platforms.
Our contributions: A fast performance analysis of multimediaMPSoC platforms requires a different approach than a naive method of simulating the multimedia application with all the video clips on simplescalar to find the task workloads. The three major contributions of our work are:
1. We estimate the workload of the various tasks using bitstream analysis (avoids full decoding) and then classify the video clips based on this workload.
A fine grained approach is used in choosing the VCCs
(for classification) relevant to each stage in the architecture.
A new model for IDCT workload was developed.
A previous work [10] introduced a novel concept of variability characterization curves (VCCs) where each video clip was represented using its VCC. This concept of VCCs was also suggested to be appropriate in [3] for identification of different application scenarios. The intuition behind using VCC as the performance model is the hypothesis that video clips with similar VCCs would exhibit similar maximum buffer backlogs and maximum delays for one macroblock. However, in order to compute the VCCs, we first need to compute the workload values for each task. A straightforward way to compute these workload values uses time consuming simplescalar simulations. In this work, motivated by a workload model for MPEG-2 decoder tasks presented in [14] , we propose a fast model-based performance analysis method which integrates our workload model of the decoder tasks with a performance model (using VCCs) of the MPSoC architecture, thereby providing a fast and efficient clustering of the video clips. Here, simplescalar simulations to obtain workload values for each task is completely avoided and bitstream analysis (incorporating our MPEG-2 workload model) is used instead. Consequently system simulations can be run with only one video clip from each cluster, thereby considerably reducing the total simulation time. In addition, we also perform fine grained classification of video clips in each stage of the MPSoC architecture for a MPEG-2 decoder. This provides a way to identify the VCCs relevant to each stage of the architecture. The next section presents the details of the related work. Organization of the paper: The related works concerning workload classification are discussed in Section 2. We present the overview of our framework in Section 3. Section 4 describes the performance model based on variability characterization curves. Here, we highlight the significance of using VCCs in modeling multimedia variability. In Section 5, we present the MPEG-2 decoder workload model. This is then used by a fast bitstream analysis to compute the task workloads, which in turn is used to generate the VCCs. The test case classification method based on VCCs is described in Section 6, which includes the experimental framework. The validation process for our methodology is explained in Section 7 where our hypothesis is experimentally established. Section 8 presents the concluding remarks.
RELATED WORK
The concept of representative workloads, in order to reduce the number of test inputs, has been comprehensively studied in the area of microprocessor design. Some of these have dealt with classifying program-input pairs based on microarchitecture dependent characteristics [2, 7] . The microarchitecture dependent program characteristics typically used were instructions per cycle (IPC), cache miss rates, branch misprediction rates and many other such characteristics. There has been some work performed to identify representative workloads based on microarchitecture independent characteristics such as register traffic, working-set size, data stream strides and instruction-level parallelism [6] . These are not instruction set architecture (ISA) or compiler independent. However, in the context of a multimedia MPSoC architecture, the characteristics used in the microprocessor domain do not capture the variabilities inherent in the test inputs (for example MPEG-2 video clips).
As there are many program input characteristics, they have been classified using Principal Component Analysis (PCA) in most of the papers. This reduces the correlation among the program inputs and thereby resulting in a smaller subset of inputs which have minimum correlation. Eeckhout et al. [2] suggest the need to select a representative workload for a target domain of a microprocessor. They mainly propose a method of selecting the benchmarks and input data per benchmark as representative workloads. Selecting a large number of them prohibitively increases the simulation time as they are constituted of many instructions. The authors used statistical analysis techniques like PCA and cluster analysis to extract representative workloads from the entire workload space. It was performed by measuring the similarity in behavior of the programs and finally establishing the fact that programs which are close in the workload space have similar behavior. To elaborate on the PCA method, the workloads are initially characterized in a s-dimensional space, where s represents the number of program characteristics that influence the performance. As s is too large and as there is some correlation among the s characteristics, the s-dimensional workload space is reduced to a p-dimensional space such that p << s.
PCA [8] is used to transform the s characteristics X1, X2, . . . , Xs into s principal components Z1, Z2, . . . , Zs (which are linear combinations of the original variables such that the principal components are uncorrelated) such that Zi = The total variance remains the same after the transformation, but some principal components have a large variance while some have a small variance. The ones which have smaller variances can be eliminated without much loss of information. This reduces the workload space into a pdimensional space with p principal components. In this pdimensional space, it is seen that different benchmarks will be far away from each other while the inputs from a benchmark are clustered together. Strong clustering indicates that one or a few inputs can be used to represent the cluster, while weak clustering might require the selection of many inputs. This concept led to our intuition that video clips with similar VCCs and clustered together will exhibit similar performance characteristics.
Cluster analysis is a method to group n program-input pairs depending on the values of s workload characteristics. This hierarchical clustering algorithm starts by considering each program-input pair as one cluster. It also has a n × n matrix of the distances at which each program-input pair is located with respect to the other. Each iteration groups two clusters having the shortest linkage distance into a new cluster. This continues until one cluster remains in the end with all the program-input pairs. Different distance measurements are used in the literature. A dendrogram is used to graphically represent the linkage distance between two clusters that are grouped together in one iteration. Another clustering method used is k-means clustering [9] .
John et al. [7] propose the characterization of workloads based on application's intrinsic properties like memory ac-
MPEG-2 video clips

Video clips represented
Step 1
Step 2
Step 3 cess behavior, locality, control flow behavior, instruction level parallelism, etc., which helps in the formulation of a program behavior model. This can then be used in conjunction with a processor model for analytical performance modeling. A study of memory reference locality using some generic metrics was also proposed. The measures used were the inter-reference temporal density function and the interreference spatial density function. The inter-reference temporal density function f T (x) is the probability of having x unique references between successive references to the same item. Similarly, the inter-reference spatial density function f S (x) is the probability of reference to a location x units away between references to the location of origin. According to the reasons already mentioned, multimedia workload characterization using properties like memory access behavior, locality, control flow behavior, instruction level parallelism etc. will not work well for multimedia MPSoC performance analysis as they do not capture the burstiness and variability in multimedia workloads.
Characterization of video stream inputs is somewhat different from the workload characterization in the microprocessor domain. It needs a platform independent approach to identify scenarios across the media streams. Hamers et al. [5] use such a method for resource prediction in media stream applications. The approach proposes to use macroblock profiling to group frames with identical decode complexity from various streams into scenarios. The resources that were predicted for evaluation were the decode time, quality of service and energy consumption. However, extraction of these parameters to group frames takes more time than our method which groups video clips using VCCs.
OVERVIEW OF OUR FRAMEWORK
A schematic overview of our performance analysis framework is shown in Figure 1 . Given a library of video clips, we perform the following steps in order to classify them into clusters 
We perform bitstream analysis of each clip in accor-
N t k Di l Network Interface
MPEG-2 encoded
Partially decoded
Completely decoded macroblocks macroblocks macroblocks 
The two parameters (cycle requirements and bits per macroblock) extracted in
Step 1 are then used to derive the corresponding VCCs in accordance to our performance model.
3. These VCCs are first used to transform the video clips into the VCC space. Then a hierarchical clustering of the video clips is performed based on a distance measurement between the clips in the VCC space. As a result, it is then possible to use one video clip from each cluster and perform simulations. The system designer can control the number of required clusters.
Of the three steps discussed above, bitstream analysis is the only step which is specific to the codec. This is be-cause we have to develop a workload model of the tasks for each codec which is then used during bitstream analysis. However, once the workload model is developed, bitstream analysis can be quickly performed as the video is not completely decoded. The VCC generation and clustering steps are codec independent.
The MPSoC platform architecture used for a case study of the MPEG-2 decoder application consists of multiple interconnected processing elements (PEs) as shown in Figure 2 . The tasks are split and efficiently allocated to the PEs. The PEs communicate by passing data units or stream objects between them. P E1 and P E2 are the two programmable processors. It also consists of the input/network and output interface. After mapping the MPEG-2 decoder application onto the MPSoC platform, P E1 performs the Variable Length Decoding (VLD) and Inverse Quantization (IQ) tasks, while P E2 performs the Inverse Discrete Cosine Transform (IDCT) and Motion Compensation (MC) tasks. The stream objects on which the PEs operate are macroblocks (MBs). Partially decoded MBs are sent from P E1 to P E2 through buffer B2 while fully processed MBs are sent out of P E2 to the output interface through buffer B3.
VARIABILITY CHARACTERIZATION CURVES
The hypothesis of this work is that video clips with similar VCCs cluster together in the VCC space and exhibit similar performance characteristics namely worst case buffer backlog and worst case delay for one MB. There is a strong indication that this is true because VCCs accurately characterize the data-dependent variability in the (i) execution times and (ii) input-output rates of the multimedia processing tasks. This process of quantitatively modeling the input stream variability constitutes our performance model. The burstiness in the arrival of streams can also be characterized using this method. These factors collectively contribute to the values of the performance characteristics.
VCCs specify the best and the worst case quantities of the variable characteristic with respect to an input parameter. It can be sequences of consecutive executions of a task or sequences of consecutive time intervals of some specified length. A VCC is composed of a tuple [(ν
, where k is the input parameter representing the length of a sequence. ν l (k) represents the lower bound on some characteristic that holds for all subsequences of length k within some larger sequence. ν u (k) is the corresponding upper bound. More specifically, if P (n) denotes the measure of a property for the first n items in the sequence, then
Based on the above definition of VCC, a workload VCC [γ u (k) , γ l (k)] can be defined as execution requirement bounds for a task mapped onto a PE in terms of the number of processor cycles for any k consecutive MBs. In other words, if W (k) represents the number of processor cycles required by a task for the first k MBs in the video stream, then we can define for any i Similarly, the consumption and production VCCs can be
are the bounds on the number of activations of a task for any k consecutive stream objects. Likewise
are the bounds on the number of stream objects produced by k consecutive activations of a task. It is hypothesized that video streams having similar VCCs will have similar worst/best case behaviors (e.g.: maximum backlogs in buffers).
An important aspect of VCCs that is understood here and which works in its favour, is that it is a more realistic model that can be used for the estimation of the resource requirements on a platform. Let us analyze this property of the VCCs. Let us denote the maximum execution cycle requirement for the execution of a single MB as emax and the minimum execution cycle requirement for a single MB on the same video clip as emin. We can obtain the worst case execution time denoted by k × emax and best case execution time k × emin for k consecutive MBs by linear interpolation of the corresponding execution times for 1 MB. Further, let us denote the upper and lower workload VCCs of this task for k consecutive MBs to be γ u (k) and γ l (k), respectively. It can be proved from the definition of a VCC that, for k consecutive macroblocks in a video stream
The above equation is shown graphically in Figure 3 . The differences δ u (k) and δ l (k) shown in Figure 3 are defined as
These differences show how much a worst case estimate and a best case estimate deviate from a more realistic estimation using VCCs. Hence, the performance model using VCCs does not take the extreme resource requirements for a task. At the same time, it does not under estimate the resource requirement for a task as is observed when the linear interpolation of the best case execution time is used. A more realistic estimate using VCCs makes sure that the MPSoC 
MPEG-2 DECODER WORKLOAD MODEL
The major tasks involved in MPEG-2 decoding are VLD, IDCT and MC. The computational workload required for other tasks such as IQ is negligible. The MPEG-2 decoder workload model depicts the computational workload (at MB granularity) required for each of the major tasks in MPEG-2 decoding. The workload model was developed for a RISC processor (similar to a MIPS3000) without any MPEG specific instructions. The MPEG-2 decoder application used for simulations was Test Model 5 (TM5) [1] . The simulations here refer to the simulations required for one time development of the workload model of the decoder tasks that are mapped onto the MPSoC platform and not simulations to obtain task workload values of every new video clip added to the input library. This workload model is employed in Step 1 of our performance analysis framework shown in Figure 1 to extract workload and arrival rate information.
VLD Task
It was experimentally found that the processor workload depends on the length of the Huffman codes which implied that the workload for VLD depended on the number of nonzero IDCT coefficients. The simulations showed this relation and in fact established that it was a linear relationship. Hence, the processor workload for the VLD task at MB granularity is modeled as:
where W orkload vld is the estimated number of processor cycles for VLD decoding of the MB, n coef f is the number of non-zero coefficients in the MB and a and b are constants that depend mainly on the processor architecture. This straight line fitting for VLD workload is supported by a plot of number of processor cycles required (from simplescalar simulation) versus the number of non-zero coefficients obtained for a video clip. This is shown in Figure 4 . From simulations, the values of a and b for the above mentioned processor were fixed at 140 and 3000. The VLD workloads obtained for 50 macroblocks of 4 video clips (from Table 1 ) using the workload model based on Equation (5) and simplescalar simulation using the ffmpeg open source decoder code are plotted in Figures 5(a) and 5(b) . It is observed from the two graphs that although the VLD workload model was derived by instrumenting a different source code, both graphs are very similar, exhibiting identical characteristics for VLD processing. This demonstrates the validity of the VLD workload model.
MC Task
MC is another expensive task in MPEG-2 decoding. There are three types of MBs in MPEG-2 bitstream namely I-type (do not require motion compensation), P-type (require only forward motion compensation) and B-type (require both forward and backward motion compensation). Hence it was intuitively concluded that P-type MBs require half the number of processor cycles than B-type MBs while I-type MBs do not consume processor cycles for MC. However, this rough prediction does not suffice for MC. There are other parameters on which a MC function depends such as 
IDCT Task
We estimate the IDCT workload requirement for each MB in a video clip based on the position of the IDCT coefficients in the 8x8 block structure in the MB. The MPEG-2 stream that was used to run the experiments had the 4:1:1 chroma format. This implies that each MB had 6 blocks with 64 IDCT coefficients each. The workload requirements for these MBs varies with two types of frame formats namely IntraFrames and Inter-Frames. The number of non zero IDCT coefficients in significant positions of the 8x8 block were extracted and then used to estimate the workload requirement for each block. Here, significant positions are those positions which are the main contributors to the workload values in the IDCT task. Let the number of non zero IDCT coefficients in the significant positions be n idct . This value can be negative if the number of zero IDCT coefficients in certain positions exceed the number of non zero IDCT coefficients in other significant positions. Then the IDCT workload estimate for each MB can be calculated as
where W basis is the base workload value that is the minimum required workload if there is atleast one non zero IDCT coefficient in a significant position. It varies depending on whether the frame type is Intra-Frame or Inter-Frame, the values being 10782 for Intra-Frame MBs and a linear combination of the values 374, 1863 and 1981 for Inter-Frame MBs. The value of α has been found to be 118. Hence, we did not require a LUT. The IDCT workloads obtained for 50 macroblocks using the workload model based on Equation (6) and simplescalar simulation using the ffmpeg open source decoder code are plotted in Figures 5(e) and 5(f) . It is observed from the two graphs that the IDCT workload model exhibits similar workload requirements as obtained using simulation. This demonstrates the validity of the IDCT workload model.
Earlier works on IDCT workload modeling were performed for workload-scalable transcoding as in [15] . The authors use a look up table to predict the workload values for IDCT based on whether the frame type is Inter-Frame or IntraFrame and also based on the position of the most important non-zero IDCT coefficient. As they considered skipped frames also, they required a 3x64 LUT to predict the workload value. In [11] , the number of significant non-zero IDCT coefficients is decided by an energy threshold.
Total Workload
The total workload for MPEG-2 decoding can therefore be obtained by adding up the values predicted for the VLD, MC and IDCT tasks. These workload values are now used to generate the VCCs at the various stages in the architecture.
TEST CASE CLASSIFICATION
We utilize the bitstream analysis method incorporating the workload model described in Section 5. In addition to the workload values, we also extracted macroblock sizes (in bits) of the encoded bitstream (in order to obtain arrival rate information) by just parsing through the frame structure of the video clips. The VCCs obtained from these two quantities are used to perform classification of the MPEG-2 clips shown in Table 1 . This turns out to be a faster design methodology from a system designer's perspective compared to simplescalar simulation as the time required for classification using bitstream analysis is much lower. The metrics used for performance analysis of the MPSoC architecture are worst case buffer backlog and worst case delay for one MB.
To classify two streams based on a single variability, a dissimilarity measure is used. The dissimilarity between two VCCs for each of the points k = 1, 2, . . . n is found using the City Block metric [4] . The pairwise dissimilarity between two streams i and j, with respect to a VCC of type r, is then computed using
where Θri (k) represents a VCC of type r associated with the ith stream and ωr (k) = 1/k are weights to normalize the differences |Θri (k) − Θrj (k)| over the length k of the analysis interval. With more VCCs, the pairwise dissimilarity between the streams for each VCC is calculated using Equation (7). This is combined to form the overall pairwise dissimilarity measure between two streams i and j with re- 
The overall pairwise dissimilarity measure is obtained by giving equal weightage for each VCC. The complete linkage algorithm is used to classify the streams based on the dissimilarity measure computed in Equation (8). This is Step 3 of our performance analysis framework shown in Figure 1 . A dendrogram of the hierarchical cluster tree is then obtained as a result of the classification. Next, we discuss the experimental framework that is used to validate the claim that the bitstream analysis approach actually results in proper identification of representative workloads for a MPSoC platform.
Experimental Framework
Here, the concepts discussed in Sections 4 and 5 are integrated and applied to the different stages of the multiprocessor architecture as shown in Figure 2 .
The video stream is first parsed to extract the required characteristics, namely workload requirement per macroblock and bit sizes of each macroblock. For this, we use TM5 as our decoder source code in order to implement the workload model for the VLD+IQ and IDCT+MC tasks. The code to compute the workload values of different task sets mapped to each PE was inserted into the appropriate modules of TM5. The bit sizes per macroblock are also computed by keeping track of the count of bits as the procedure for decoding a macroblock is entered. The executable is then run for each of the clips used in the test set. It is interesting to note here that a certain group of clips exhibits higher variation in output workload values in comparison to other groups. A similar observation was also made for the number of bits per macroblock. This led us to the intuition that VCC curves obtained from these values can be used to classify the videos as it characterizes the bursty nature of video data and the accompanying variation in the workload requirements.
Once the bitstream analysis is performed, the next task in the process of classification is the generation of the VCCs. In this step, we produce the workload VCCs as described in Section 4, but there is a variation in the idea of what VCC curves to generate. As we already obtained the workload values for the tasks VLD+IQ and IDCT+MC by bitstream analysis, we generate separate workload VCCs for these sets of tasks denoted by [γ In addition to these, we also obtain a VCC from the bits per macroblock statistics. As the input bit rate of the video clips is constant at 8 Mbps, we compute the input arrival rate of each macroblock, which is then used for the generation of the macroblock arrival rate VCC denoted by [κ Observations: It is clearly evident from the obtained VCC curves and dendrograms that the VCCs obtained as a result of bitstream analysis of the MPEG-2 clips provide the basic clustering into motion and still videos. The classification is specific to the different stages in the architecture which is a more fine grained approach than performing it for the entire architecture. This gives a more accurate classification of the video clips as different combinations of VCCs play a decisive role in the determination of the architecture specifications at various stages. In the next section, we discuss the setup and procedure to validate the claim that bitstream analysis based generation of VCCs actually aids in classification of workloads.
VALIDATION
The integral architectural parameters of the MPSoC platform shown in Figure 2 are the processor frequencies and the sizes of various buffers, namely the input buffer, the intermediate buffer and the playout buffer. In the current step, we fix a particular frequency pair corresponding to the two PEs. This selection is currently not based on any analytical framework as we are not concerned about any playout buffer underruns in this experiment. Here, we are more concerned about the various buffer occupancies and try to establish the claim that similar videos that are nearer to each other in the cluster trees shown in Figures 7(a) , 7(b) and 7(c) also exhibit similar buffer occupancies. This claim can be emphasized even more by showing that the pair of video clips which are closer than others exhibit less difference in their maximum buffer occupancies than other pairs. This provides strong evidence for the validity of the bitstream based classification of video clips. The maximum buffer size required for each video clip is computed using the equation
where i = 1, 2, 3, . . . , N (N is the last macroblock number in the video stream). τcurr is the time instant at which the MB being serviced by the P E is completely processed, Bufi is the buffer backlog when the (i) th macroblock is inserted into the system, Buf0 = 0, τarr i is the time instant when the (i) th macroblock arrives and Buf f erbacklog is the maximum backlog in the input buffer. The interpretation of the above equation is straightforward. The buffer occupancy keeps increasing as new MBs enter the particular stage of the architecture and it reduces as they are completely processed by the PE and sent to the next stage. The worst case delay for one MB can be computed using the equation (10) where τ mbcyc i is the processor cycle time required for (i) th macroblock. The expression of worst case delay for 1 MB given by Equation(10) takes the following two cases into consideration 1. All the previous MBs have been processed before or when the new MB arrives in which case the delay for the arriving MB is τ mbcyc i .
2. If previous MBs have still not been processed while a new MB arrives, then the processing of the new MB can start only after all the MBs ahead in the buffer are processed.
In order to check the above mentioned validity, we have simulated the multiprocessor architecture using a SystemC simulator with the workload cycles obtained from simplescalar simulation (sim-safe configuration). The P E1 frequency was fixed at 40 MHz while the P E2 frequency was fixed at 200 MHz. The results obtained are very much in support of the idea we started with and are presented in Table 2 .
It is immediately observed from the results that the motion and still videos that form separate clusters also give similar buffer occupancies in their respective clusters. However, more importantly we can observe that some pairs of video clips which have smaller linkage distances in the cluster trees exhibit similar buffer occupancies. In the case of B1, videos 4 and 8 have very similar maximum backlogs when compared to videos 1 and 7, the difference in maximum backlogs being 11333 and 28350, respectively. For B2, videos 4 and 9 exhibit the most similar backlog difference (32009) compared to videos 4 and 8 (50575) and videos 1 and 7 (34808). Videos 1 and 6 are more similar in the playout stage compared to 4 and 5 which is also evident from the cluster tree of the playout stage. The similar worst case delays for one MB among video clips from the same cluster are also evident from the simulation results for the maximum delays for one macroblock shown in Table 3 . In the case of P E1, it is seen that videos 1 and 7 have similar maximum delays while videos 4 and 8 are closer to each other in their maximum delays. It is also seen that video clip 10 is much closer in maximum delay to video clips 1 and 7 than video clips 4 and 8. This behavior is also seen in P E2.
We have also conducted experiments to validate that the VCC and bitstream analysis based classification methodology works irrespective of the application task mapping onto the P Es. In our case study of the MPEG-2 decoder, we looked at combinations of tasks other than the one mapped onto each P E in the above exercise. Due to space constraints, we are not presenting the results here. Therefore, it has been experimentally established that the methodology is independent of mapping of the application tasks on the P Es.
CONCLUDING REMARKS
In this paper, we have presented a fast and efficient modelbased test case generation methodology for performance analysis of multimedia MPSoC platforms. Our method completely eliminates the time consuming simulations required to cluster the library of video clips. It also gives the system designer control over the selection of the number of representative video clips. We have validated our method in the context of a MPEG-2 decoder application running on a MPSoC architecture with two PEs. The performance metrics analyzed to prove the validity of the method were worst case buffer backlog and worst case delay for one macroblock. It would be interesting to extend this work to more complex architectures involving some microarchitectural details. Moreover, at each stage of the MPSoC architecture, all the VCCs used for classification were given equal weightage. For complex architectures, this might not be appropriate. In such cases, the weightages have to be determined in an iterative fashion. Starting with an initial weightage vector for the VCCs, the method would be helpful in achieving an accurate clustering of video clips that adhere very closely to the performance metrics obtained from simplescalar simulations. The similar kinds of tasks such as VLD, DCT and motion prediction etc that exist in other codecs like H.264 gives the proposed classification methodology a good possibility of being used for codecs other than MPEG-2 also. This will be studied thoroughly in a future work.
