Abstract-As CMP became the main stream of processer design, parallel programming is a new challenge for programmer. The execution of the same program may perform much different based on various multi-core architectures. Even the same multi-core processor combined with different mapping strategies are still with distinct performance. How could programmers figure out if their programs, which based on specific multi-processing architectures and mapping strategies, are efficient and even portable? In this paper, we propose Architecture-Based Trace and Evaluation (ABTE) and corresponding framework, which intelligent helps programmers to approximate the performance of their solutions without real running. ABTE mainly includes two parts: 1) the library of architecture models and algorithms; 2) the evaluate engine. We introduce the method of describing models of various architectures and their running algorithms. Based on the models, we propose a marked object trace method to help evaluate the parallel solutions, and use it in the evaluate engine. We explain ABTE by a case study, and the evaluation shows that ABTE can help programmers find the better solution to a parallel application without real running.
I. INTRODUCTION
Parallel programming is hard for programmers who are familiar with serial programming. They must transform their serial thinking style into parallel, and learn more about architectural details. Without these transforming, the solutions of parallel application may waste available computing resources and even get poor performance.
Programmers are familiar with several software parallel computation models, such as PRAM [13] , BSP [12] , and LogP [14] etc. Using the parallel computation models can design and describe their algorithms with possible omissions to some hardware details. Shared memory and distributed memory with massage passing are two basic concepts to build these models that widely used in parallel programming. And as a particular case, stream computing is also an important technology in special areas like image processing and science computing.
A combination of selections from software parallel computatiom models, algorithms and hardware architectures are called solutions to an parallel application. Which solution performs better is important for programmers. If the performance can be approximated without real running the solution, programmers would need to only focus on the solutions with better match to their requirement. The features of programming and solutions should be explored for better estimating.
SWARM [15] is a parallel programming framework for multicore processors. It is a descendant of a SMP node library component to develop efficient multicore algorithms. Because of their application is limited to SMP (Symmetrical Multi-Processor) computing, they offer a multicore model considering three primary issues affect performance: number of processing cores, caching and memory bandwidth and synchronization. Streamware [17] offers a framework and a runtime environment to make it possible to use the stream programming model to efficiently run on general-purpose processors. This software system uses a fixed software model and various hardware models.
In this paper, we propose Architecture-Based Trace and Evaluation (ABTE), which can estimate the performance of a solution before running. It includes two parts: 1) the library of architecture models and algorithms; 2) the intelligent evaluate engine. In our framework that implementing ABTE, we target on not only general purpose processors, but also stream processors. Our hardware model address on the ability of reconfiguring fine granularity memory blocks, processing elements, and interconnections, thus the hardware model is scalable and flexible. We also offer a software model library including popular parallel programming models.
Except stream programming model, there are two major parallel programming models: massage passing and shared memory. The Message Passing Interface (MPI) [5] is actually a standard for massage passing libraries. It has been implemented in many languages. SMP is a more highly integrated system in which processors share a global memory. Compared with MPI, The paper is partially supported by Beijing Key Disipline Program.
OpenMP [6] is a better way for processors to interact with each other in a shared memory model.
Many tools and solutions are mainly implemented in two ways: library-based and new language-based. MPI and OpenMP are all add a parallel library to existing languages. The new language-based way, such as DSVMs [16] or stream languages like Stream-C/Kernel-C [8] and StreaMIT [7] , offers a new language to manipulate on specific hardware, and can map the application directly to the architecture or API.
We integrate a programming interface in our framework, and allow add-in several kinds of parallel compilers. This makes our framework becoming language-unrelated. Programmers can focus on the optimization of their applications, and use languages they familiar with.
In section 2, we introduce the method of describing models of hardware architectures and algorithms. The software model and algorithms involved in our method is explained in section 3. And the ABTE framework is also introduced in this section. A method of marking object for tracing and evaluation is proposed in section 4, and a simple application is used as an example to explain the course of ABTE works. In section 5, we use ABTE to evaluate several solutions to the application, and compare the results with experimental results. We find that though ABTE can't work as accurate as runtime evaluation, it can help programmers find the correct trend of which solution is better.
II. PARALLEL COMPUTING MODEL
In ABTE, hardware architecture is an important factor involved in evaluation. We extract key features from architectures and describe them in an object-oriented pattern to form corresponding Parallel Computing Models (PCMs). A PCM is set up based on static information of a specific parallel hardware, which supports several running modes. These running mode based on the hardware features are called PCM algorithms. For example, the IBM Power5 [10] architecture can be described into Power5 PCM, which includes its features of processing elements, memory and the interconnections between them. The Power5 PCM algorithms are running mode according to the feature of Power5, such as Power5 SMT algorithm which describes that how the thread could be scheduled into CPU and running.
In this section, we propose the method of building PCM and PCM algorithm, and explain them by modeling TriBA [3] , which is a novel multi-core architecture.
A. Model description
PCM are composed of processing elements (PE), memory system (MS), and interconnections (IC) between them. In logical view, every feature is the composite of many components, the types of which may be same or different. In the implementation, we don't really generate so many objects, but define position and range information of forming a block, and save a pointer refers to one object saved in a component pool to represent the component type to form the block. In Figure 1 we take memory system as an example. We suppose that a memory system is composed of several memory blocks with two different types, which are represented as cycles and squares respectively. Figure 1(a) shows the logical view. If we don't use component pool to optimize it, 3×3+2×3+3×2=21 memory element objects would be generated. Figure 1(b) shows the implementation with component pool. In this way, only 2 memory element objects are generated. In Figure 2, 
Elements of vertexes set V could be processors (P), memory blocks (M), switches of on-chip network(S) or buses (B). If there is a direct path between two vertexes i and j, then an edge (i, j, w) will be included in the edge set E, and w is the weight of this edge. This weight represents the transfer latency between two vertexes. The latency's order of magnitude is set from the performance of real hardware. For example, a processor P has a private on-chip cache A, and it can also access off-chip memory B, then edges (P, A, W1) and (P, B, W2) are included in the ICG, in which if W1 is 3, then W2 may be 30 or even larger. This helps in evaluation and can inform users to revise their design for better use of cache and reduce the accesses to remote memory. Each vertex in ICG can also be a smaller ICG. In this case, we can express different granularity of the system. The more detail granularity is, the more accurate the evaluation would be.
B. PCM algorithm description
PCM algorithm determines when and how to process or move data to complete the function of an application. It is represented as a set of objects and several relationships. As shown in XTable 1X , the objects are with two types: Operation and Data, each of which has several subtypes. Data objects as a formatted communication unit (Message objects) or a block of data (Data block objects) are passive that only can be manipulated by Operation objects. While Operation objects are active to process data (Function objects), resize data (Split/Merge objects), or direct objects from one place to another (Schedule objects). Objects may be composed of sub-objects, and for each time that object is divided into sub-objects, the algorithm is described in a more detail way. As the granularity of objects is fine enough, the algorithm would be translated into executable codes.
In parallel programming, two basic relationships are required: Parallel (represents with ||) and Dependent (represents with f ). Let A and B are two objects. A||B means that A and B can be execute at the same time;
A f B means that B only can be executed after the execution of A.
For representing in a more compact way, we defined three kinds of code blocks: Pipeline block, Serial block and Parallel block. Based on the representations upon, PCM algorithms can be easily built and stored as templates in a PCM algorithm library. Here we take one PCM algorithm as an example (see XFigure 2X ).
In Figure 3 . , The SameTask algorithm describes that tasks (which are function objects) with the same function are distributed into each PE, and the data is divided into blocks to parallel processed by the PEs. In the representation A~B, B is the name of function A. In the 2-3 lines Os 1 schedule a function object to n computing elements, and in the mean time Od split Dd into m*n smaller data objects. The objects in the "<>" after ":" means the operands of the object before ":", and that after " →" or " ⇒ " represents the result of the operation or the destination of the schedule. PCMs and PCM algorithms can be represents by the architecture designers. These representations are saved into a library that used by ABTE intelligent engine.
C. TriBA based parallel computing model
TriBA [3] is a multi-core architecture with a special triplet-based interconnection, different scales of which are shown in XFigure 4X .
(a) (b) Figure 4 . Hierarchical interconnections of computing cores and hierarchy memory system in TriBA [3] Every node in the figure is called a cell, the typical configuration of which consists of three function units: processing unit (ProcUnit), data unit (DataUnit), and interface unit (InterUnit). A L1 cache permits private access of each ProcUnit. Every three cells share access a L2 cache. If cache misses, each of the 9 cells can access to an off-chip memory.
The course of configuring 9-cell TriBA pattern in ABTE is described as follows:
The basic computing element of TriBA is ProcUnit in a cell, so only one instance of ProcElement with parameters initialed as ProcUnit is generated and stored in processor pool. Similarly, three kinds of memories are used in TriBA: 16KB on-chip L1 private cache, 64KB on-chip L2 group shared cache, and 512MB off-chip memory. L1 cache and L2 cache are built by a special 4-port SRAM, while off-chip memory is a DRAM. So there are two kinds of objects of SByte_4P and DByte stored in the memory pool to build SRAM or DRAM respectively. The objects of different MemoryBlock are initialed in XTable 2X . The Interconnection graph is initialed as Figure 5 . 
, ),( 2 , , )}; In ICG TriBA , different kinds of edges are of different weights, which are approximated by the real architecture of TriBA. The PCM of TriBA is configured and store into the PCM library. Programmers will be able to choose the exist model in this library.
III. PARALLEL SOLVING MODEL AND ABTE FRAMEWORK

A. Parallel solving model
As a PCM describes the parallel hardware, a Parallel Solving Model (PSM) shows the feature of parallel software design. The algorithms based on the mechanism of specific PSM are called PSM algorithms. The course of solving a parallel problem is actually a course of mapping, from software to hardware, from PSM to PCM, and from PSM algorithm to PCM algorithm. Without proper mapping, parallel applications would not run efficiently.
Two software models are widely used in parallel programming: message passing and shared memory. Stream processing is also a parallel model playing an important part in image and science computing. We set these three models as basic PSMs. Each of them could gain better performance in the architecture with features proper for the model running.
The message passing model assumes that the underlying hardware is a collection of PEs, each with its own local memory, and an interconnection network supporting message passing between PEs. So the latency of communications plays a dominated role in static evaluation.
The underlying hardware of shared memory model is assumed to be a collection of PEs, each with access to the same shared memory. PEs can interact and synchronize with each other through shared variables.
Stream processing model focuses on processing high volume input data with multiple computational units. A set of data is called a stream, and the operations applied to each element in the stream are called kernel functions.
The underlying hardware of stream processing model is a stream processor which equipped with a fast, efficient, proprietary bus or crossbar switches, and a large cache in which stream data is stored to be transferred to external memory in bulks.
PSM algorithms are actually the parallel algorithms that people always call. They may base on one or more PSMs. For example, when solving Gaussian elimination problem X[4]X, we can use row-oriented algorithm or column-oriented algorithm, which are based on message passing PSM.
PSM and PSM algorithms are represented in software aspect, which are actually the common view that programmers think about parallel programming. Hardware features are not easy for programmers to understand or even to consider its effects in the course of coding. The architecture-based trace and evaluation we proposed here aims to help them figure out how the performance would be when a PSM combined with specific hardware before programming on it. Figure 6 . Organization of ABTE ABTE framework can motivate the greatest creativity of human designers and achieving optimized way to solve a given problem. The framework captures the feature of solving parallel problems. It is organized as an objectoriented system, in which every part can be seemed as an object. Thus the framework is scalable and portable. User can simply add or change function components to construct a more suitable framework. As shown in Figure  6 , it is composed of following parts: 1) Customize interface. It is an interface offered to programmers if they choose system-aid programming, then help them choose models and algorithms that fit for their applications.
B. ABTE Framework
2) Model library and algorithm library. Model library keeps PCMs and PSMs, which have been expressed in corresponding model description patterns, the definition and examples will be introduced in next section. The libraries can be automatically refreshed by several times of running of the system.
3) Evaluation engine. It is used for evaluate the input design, and return the evaluation result to programmers to optimize their design. According to the design environment, two kinds of evaluations can be used. Static engine analyzes the source code combining the hardware/software models and algorithms chosen by programmers. The key part of static engine is an objectextract analyzer, which will be explained following part of the paper. Dynamic engine runs the key part of the code, which can be specified by programmers, on demanded hardware architecture to obtain more accurate evaluations. However, dynamic engine only can evaluate application based on the PCMs match the hardware architecture running the ABTE framework. The selected key parts of application may not reveal the true performance. But if run the whole application, some huge problem set may cost too much time and computing resources. The static way makes it easier to primarily learn the performance of design on several sets of hardware, thus programmers can choose comparatively better method to implement their applications without too much effort. 4) Programming interface. There are two ways for programmers to design on ABTE framework. The first way is coding based on models and algorithms offered by the framework. The second way is to directly input program components to the evaluation engine, and figure out if the program is suitable for running on specific hardware architecture. Programming interface is used in both of the ways. Inputting a configure file for the application, the interface helps programmers to insert necessary marks or control primitives to the program. 5) Parallel compilers and runtime system. These two parts are necessary for mapping our programmed applications to hardware architectures. The parallel compilers, which is reconfigured based on exist compilers, help to translate programs into executable codes based on the models and algorithms that has been chosen. The real mapping operations are be realized in runtime systems.
IV. MARKED OBJECT TRACE
Applications that need to be solved in parallel are always composed of a number of tasks or require dealing with huge set of data. Most of them are actually repeating a series of operations for many times. If the latency of iteration can be computed, the performance of the whole application would be easy to estimate.
The course of solving a problem with computer is in fact composed of two parts: transferring and computing. When running a program, data should be transferred from one memory unit to another. When executing an instruction, data is still need to be transferred from on register to another. The "computing" time may only is a small part of the total latency of the program. The most time of running a program is spent for transferring. Based on this thought, we propose a method to trace the latency of transferring to evaluate the performance of parallel program.
In the field of molecular biology, luminescent protein
is widely used for observing intracellular localization of proteins. In our method, we also use "luminescent objects" to observe and trace the running of parallel program. Imagine that a set of objects are marked with green and are involved into the running program. During the life time of the objects, the transferring path they passed by and the computing elements they have emerged in will all be colored with green. The green trace would be a record for iteration. We can also use multiple colors of objects to mark multiple kinds of function components that we want to learn. By tracing these marks, we can estimate performance of the whole application. The course of ABTE is described as follows, in which (P) and (E) respectively means that the step is completed by programmers or evaluation engine: 1) Choosing the PSM, PCM and corresponding algorithms, making necessary changes and settings to form a solution of the application. Here we explain the ABTE method by using TriBA to solve a simple parallel application. The problem is to do the same arithmetic operations on a set of data. For each number A in a set with 90000 members, we will do the following arithmetic operations and save the result R back:
For the first step, programmer chooses to use shared memory PSM, and a PSM algorithm to generate multiple same tasks to compute concurrently. And naturally the TriBA PCM and the SameTask PCM algorithm defined in XFigure 3X are chosen. The parameters m and n in SameTask are set to 9 and 10000 respectively. In the second step, the object Dd 1 is marked. From the third step, the evaluation engine begins to work. It maps computing task into 9 Of objects, and the shared memory is mapped to hierarchy memory system of TriBA. The fourth step is to trace the path of marked object Dd 1 : Dd 1 stored in off-chip memory are transferred into a L2 cache, and then be transferred into a L1 cache. Processors access L1 cache to fetch Dd 1 , and computing, and then send the result Dd 1 ' back. When all the numbers finished computing, L1 caches transfer numbers to L2, and L2 transfer to off-chip memory. Dd 1 ' is transformed from Dd 1 , so the mark (color) is also taken by Dd 1 '. There is also some time that Dd 1 is not active, such as staying at memory waiting for others are transferred. The static time is marked in the trace to process in the next step. When computing the total latency of this application, the engine evaluates the parallelism of Dd 1 -like objects.
XFigure
7 X(b) shows the traces of them. The structure of special four-port memory in TriBA can support four concurrent transfers. According to the PCM algorithm and parameters of TriBA architecture, we can compute Latency of each phase using the formula, in which BW(M) means the band wide of M, D is the data block that transferred, and w is the weight defined in ICG:
As band wide of each port of OM, L2, L1 are 8, 32, 32 respectively, we can find other 23 traces are exactly like Dd 1 . Each of them is followed by 3749 objects at the phase OM-L2. After the synchronization in L2, there are other 95 objects transfer like Dd 1 , and each followed by about 312 objects at the phase L2-L1. In L1-PU phase, each object needs not to wait for others, the L1-PU, PU computing and PU-L1 phase form a 3-stage pipeline, whose latency is decided by the longest stage. So the total latency of this solution is computed as 3750*w 8 + 313*w 7 +2*w 4 +T*10000.
The result is returned to the programmer, who can choose to change for other solution or setting detail of the model to gain more accurate results.
V. EVALUATION
In this section, we still take the application in section 4 as an example to evaluate ABTE in two aspects: usability and accuracy. We have already used ABTE to evaluate the solution with shared memory PSM and TriBA PCM. Now we will consider the evaluation of other solutions and compare them to find a better one.
We configure four solutions, which are different combinations of two PCMs and two PSMs. The PCMs are 9-core TriBA and a 9-core CMP with on-chip network as 2D-mesh, in which every core has a 16KB L1 cache and all of them have a distributed and shared L2 cache (192KB in total) and a 512MB off-chip memory. These two PCMs seemingly have similar computing ability. The PSMs we choose message passing and shared memory. The algorithm with message passing is to distribute one kind of operation (addition, subtraction and multiplication) to one computing element, and we finish computing an R will take 3 cores. The result computed by one core will transferred to another core as a message to do the next computation. The operations of the three cores can form a pipeline.
In XTable 3X , the main transfer latency of the two PCM is listed. OM means the off-chip memory and C-C means the latency of communication between two directly connected cores. The computing time of multiplication is 6, while addition and subtraction are 3. 
Latency (million)
Total latency
Transfer latency Figure 8 . Latency of four solutions by ABTE XFigure 8X shows total latency and transfer latency of the four solutions. SM and MP are short for shared memory and message passing. From these results, programmers will know that the better solution to this application would be TriBA PCM combined with shared memory PSM. Except for the TriBA+SM solution, the latency of transferring is almost equal to the total latency. That is because the transferring occupied most of the running time, and some of the transferring time is overlapped with the computing time. The solutions with 2D-mesh PCM need more time than that with TriBA PCM. That is because TriBA contains multi-port memories, which allow for high throughput of data. A better solution for the application may have the feature of high speed transferring and memory, which are exactly the feature of stream processors. As ABTE is based on models that even the hardware architecture is not available, it can also evaluate the supposed solution. We assume that the stream processor is formed like the Imagine stream processor in X[1]X, only change the number of ALU Cluster to 9, and the size of memory system is similar to TriBA. The PSM is chosen as stream model, in which three kinds of operations are configured as 3 kernels mapping onto the clusters. The latency of this solution is shown in XFigure 9X , which is much less than the others. Though the errors between the two groups are not so small, the trend of which solution is accurate, the purpose of the framework is satisfied. Programmers can choose the better solution to an application depend on the ABTE results.
VI. CONCLUSIONS AND FUTURE WORK
ABTE method helps to evaluate a solution of a parallel problem before real running it. With the compare of performance between various solutions, programmer can find the better one to implement without waste time on solutions that can't reach the requirement.
This method is supposed to be supported by the model description of varies architectures and algorithms, an evaluate engine and an interface to programmers. Now we have defined PCMs and PCM algorithms to describe several architectures and their running mode. With the help of these models and algorithms, we have implemented marked object trace method in the evaluate engine to evaluate solutions selected by programmers.
The user interface is not very convenient that we have to choose and set the solutions by changing the options in the configure file. In future works, we will build a graphical interface that can help programmers express their designs only by dragging several graphs and input some numbers. As the evaluate engine now only focus on evaluating the latency of the solution, in the next step we will extract other features such as usage of memory and occupy of the interconnection from traces we got.
