3D graphics application is widely used in consumer electronics which is an inevitable tendency in the future. In general, the higher abstraction level is used to model a complex system like 3D graphics SoC. However, the concerned issue is that how to use efficient methods to traverse design space hierarchically, reduce simulation time, and refine the performance fast. This paper demonstrates a system-level design space exploration model for a tile-based 3D graphics SoC refinement. This model uses UML tools which can assist designers to traverse the whole system and reduces simulation time dramatically by adopting SystemC. As a result, the system performance is improved 198% at geometry function and 69% at rendering function, respectively.
Introduction
In recent years, 3D graphics application has become increasingly popular for consumer electronics, particularly, the market for 3D graphics gaming application on mobile devices [1] . Such consumer electronics are quite different from the workstations and desktop PCs for which traditional 3D graphics applications were first developed. These features include smaller screen size, a lower resolution, a shorter expected distance between the screen and the user's eyes, and a lower power requirement to conserve battery life.
Since 2005, we have cooperated with Himax Technologies Inc. [2] to develop a low-cost tile-based 3D graphics SoC, which consists of a 3D graphics engine and an embedded debugging/performance monitoring engine (EDPME) [3] - [7] , as shown in Fig. 1 . The 3D graphics engine, including a Geometry Engine (GE) and a Rendering Engine (RE), effectively balances functionality, performance and silicon area for consumer electronics with a medium sized screen (≤ 640 × 480 pixels) and frame rate (≤30 fps). This earlier work supports OpenGL ES 1.0, achieving a performance of 8.69 Mvertices/s, 278 Mpixels/s and 278 Mtexel/s, running at 139 MHz. The size of the 3D graphics engine is less than 700K gates, making it small enough for integration in an SoC for consumer electronics. Hence, we have a tile-based 3D graphics SoC prototype that includes a 3D graphics hardware accelerator and 3D graphics application software. Now, the problem is that how to continuously improve the performance and refine the architecture of the present 3D graphics SoC development, rapidly. To face this problem, the system-level design space exploration is becoming an inevitable solution for complex SoC design. In our proposed approach, we model a 3D graphics accelerator by using SystemC in cycle accuracy according to original RTL. Then we put the SystemC model into CoWare's platform architect environment [8] to simulate the whole 3D graphics SoC behavior. Then, how to process the design space exploration, progressively, efficiently, and simultaneously? The unified modeling language (UML) is also adopted to model the interaction of 3D graphics application between hardware and software. Because we focus interactions between hardware and software, the UML diagrams [9] are applied to describe 3D graphics application. In these diagrams, the objects represent the part of hardware, and the process flow represents the part of software. Therefore, we can observe the process, organization, and dependence of 3D graphics application through these diagrams. They can help designers to view the whole system. Then we traverse the design space by incremental refinement model.
The proposed design space exploration approach features the following key capabilities: 1) object-oriented analysis of system architecture (using UML model to avoid conflict between different IPs and interface in refined procedures); 2) interface-based design [12] The rest of the paper is organized as follows. Section 2 reviews related works. The proposed 3D graphics application including SystemC hardware components and C software is introduced in Sect. 3. Section 4 uses the UML tools to model interaction between hardware and software. We can traverse all design space through these diagrams and we use incremental refinement model to improve performance on flat and hierarchy level. Section 5 describes the refinement procedures and experimental results. There includes intra-IP, interface between IP and bus, inter-IP for the specific function, and inter-IP for the whole system. Finally, we conclude in Sect. 6.
Related Works
The problems of system-level design have been interested a growing attention from both industry and academia research teams. System-level design methodology is widely accepted to cope with growing complexity by raising the abstraction-level of the primary specification to explore architecture trade-offs and lead HW/SW partition decisions [11] - [27] .
Schneider [13] discussed a hardware architecture tradeoffs possibility for different abstraction level. Hardware architecture trade-off consists of architecture exploration and selection. The top-down information is collected at systemlevel during the exploration process. For architecture selection, bottom-up information is needed to estimate the hardware costs and to evaluate each architecture alternative. The exploration and selection are an interactive process, for which top-down and bottom-up information is needed in each step.
Kim et al. [17] extended the general AND-OR graphic to Attributed AND-OR graph (AAOG) with an attempt to support design space exploration of digital systems to try to find a best configuration which satisfies design requirements.
Kogel et al. [14] first proposed SystemC-based systemlevel architecture exploration for a 3D graphics processor. This case study defines a feasible system architecture coping with the 3D graphics processing and its internal memory bandwidth requirements. Crisu et al. [15] , [16] presented a design exploration framework for embedded 3D graphics accelerator, called GRAAL. The GRAAL is an open system which offers a coherent development methodology based on an extensive library of SystemC RTL models of graphics pipeline components. However, these works lack the discussion of interrelation with external memory and interface.
Fujitsu Limited [23] developed a new SoC design methodology that employed UML and C/C++/SystemC programming languages. The methodology differs significantly from the conventional System LSI design methodology in two points: 1) the use of the UML to describe the specification; 2) the introduction to UML and C++/SystemC for the phases from the system partition into hardware and software components. For heterogeneous IP integration, Sun and Wong [25] proposed an interface synthesis method that uses the UML notation to model the interfaces of predefined components and they built a code generator to produce the interface adapters from the UML models. Riccobene et al. [26] provided a UML 2.0 profile for high-level modeling of SoC to translate SystemC code automatically. In their proposed SoC design flow, a further validation step, involving both the hardware and the software parts, can be introduced at the transactional-level: the software part can be simulated by performing transactions, which carry accurate timing information as defined by the hardware architecture model at transactional-level (Transactional Co-simulation in Fig. 2 ).
Both [23] and [26] use UML stereotypes for systemlevel modeling, but they seem that these are not included in an appropriate refinement flow and real application practices. In such sense, we use some existing methods and extend them to build incremental refinement model and explore refinement possibility of four aspects (Intra-IP, Interface, Inter-IP for the specific function, and Inter-IP for the whole system) for our tile-based 3D graphics SoC refinement.
Organization of Proposed 3D Graphics SoC
In general, 3D graphics applications need 3D graphics accelerator to speedup 3D graphics data processing in realtime requirements due to their own complex computation. Basically, the 3D graphics SoC consists of a generalpurpose processor (such as ARM926EJ-S), a 3D graphics accelerator, a DMA controller, a display engine (DE), a system bus (such as AMBA or AXI), and an external memory (such as SDRAM).
The 3D graphics accelerator is to handle the geometry and rendering modules in order to accelerate 3D graphics data processing. The processed results can be displayed via a DE. Finally, the DMA can be applied for more efficient memory initialization. Figure 3 illustrates the basic flow of 3D graphics application. The 3D graphics software translates 3D graphics data from object data to vertex data as input data of geometry function. Before the software starts rendering function to read vertex data, it requests the DMA to initial frame buffer and Z-buffer to avoid that previous frame pollutes current frame. When geometry function finishes its task, then it will notify rendering function to continue. Until the current frame completes, geometry function and rendering function will iteratively render objects into frame buffer as output of rendering function. Figure 4 shows the overall 3D graphics application system architecture building by SystemC. Excluding generalpurpose processor (like ARM926EJ-S) is provided by the Platform Architect [8] , the components of the whole system are built in this paper. The ARM processor handles 3D graphics application software, and triggers 3D graphics accelerator and other peripherals. DE has a master wrapper to read frequently frame buffer and display the frame. Memory module is a SDRAM controller. Interrupter controller receives the signals from the DMA, the GE, and the RE, and notifies the ARM processor to handle. The DMA has a slave (S) wrapper to setup internal status and two master (M) wrappers with same function that accesses memory. The ARM processor controls the GE through writing context table data to GE's slave wrapper. The GE uses M1 to read vertex data and uses M2 to produce tile-list data in SDRAM. The RE gets also configuration data through slave wrapper. The RE uses M to access pixel data to frame buffer and Z-buffer. 
Data Flow of 3D Graphics Application

SystemC-Based Platform Environment
Geometry Engine
The block diagram of the GE is shown in Fig. 5 . The GE comprises two main functional modules: a geometry module (GM) and a tile divider module (TDM). Input data of the GE is from vertex data buffer (VDB) and output data is to tile list buffer (TLB) in SDRAM. The GE provides a slave wrapper, a geometry module (GM) master wrapper, a tile divider (TD) master wrapper, and an IRQ signal for outside connection. The ARM processor can setup geometry operation parameters called context table via GE's slave wrapper. The GM master wrapper provides data of internal GM that read from VDB. TD master wrapper dedicates to handle TLB. When the GE finishes the assigned task, it notifies the ARM processor through IRQ signal. The GE includes two main functional modules, a geometry module (GM) and a tile divider module (TDM). The GM takes charge of transformation and lighting (T&L) operation, culling, and clipping 3D graphics geometry operation. The TDM is implemented tile-based concept to reduce the number of memory access. The TDM creates tile list data in TLB for the RE. The output data is passed to a tile divider module, which builds a tiled triangle list for RE. The GM takes three pipeline stages and each stage spends 16 cycles. Because of the GM and the TDM need different time to process data, so the FIFO buffer is putted to store temporary data produced by GM. If FIFO buffer is full, then the GM will be stalled. Figure 6 shows the block diagram of the RE. The RE adopts a tile-based approach [28] in order to reduce the memory bandwidth. We employ a tile size of 32 × 32 pixels, which generally yields the best trade-off between the amount of onchip memory and the amount of external data traffic. The RE handles data of three types: TLB, frame buffer, and Z/Stencil buffer. The ARM processor configures the RE's context table through slave wrapper. The RE reads data from TLB and output the results into frame buffer and Z/Stencil 
Rendering Engine
Incremental Refinement Model
The Refinement Process
The proposed 3D graphics SoC uses SystemC as a description language to implement each model. Hence, function and performance can be confirmed through simulationbased verification in CoWare's platform architect development environment. Moreover, we employ UML shown in Fig. 7 as a specification language before implementation of each model with SystemC. Two effectives are expected via using UML as follows: 1) Clarification of specification: we can confirm the correctness of design before coding by using UML due to the UML can model design specification by using graphical diagrams.
2) Language independent design: we can implement UML model using any implementation languages (such as HDL, C/C++, and SystemC) due to the UML does not depend implementation languages.
Three models are used to analyze our original version of the 3D graphics SoC as shown in Fig. 4 . They are functional model, architectural model, and performance model, respectively. These models will be introduced below.
Functional Model
Functional model focuses on the structure of function but not considers any physical architectures and timing. In functional model, we expose task-level and make communication explicitly. Functional model comprises processes and communications between processes. In functional model, there are two evidences for measuring the workload of the 2) Communication workload is used for calculating the cycle time of access request including waiting and active states.
Using the statistics of computation workload and communication workload, the bottleneck of the selected architecture would be found easily, so that it can help to improve the system to achieve satisfied result of the performance requirement of the design. In Fig. 8 the communication channels among processes in function model. Each resource in architectural model is parameterized with their characteristics. For example, a processor deploys scratchpad memory. The GE owns two ports of gmaster and gslave and deploys a sendAddr() process for issuing a read request in Fig. 9 , and so on.
Performance Model
Performance model maps processes in functional model onto processing resources of architectural model explicitly, as well as assembling a communication channel with communication resources. In performance model, systemlevel trade-offs will be evaluated by hardware/software cosimulation with ISS. The time information can be derived either from a lower level model of the processing resource.
Functionality, structure, and timing must be considered together in performance model. Therefore, performance model can be applied for both function verification and performance refinement. After performance model, the partition of hardware and software will be realized, and can be implemented, respectively. We apply UML sequence diagram to overview the interaction and flow of processing resource. In Fig. 10 , we gain performance evaluation of 3D graphics SoC and it is very helpful for performance analysis.
Aspects of Design Space Exploration
A development flow is also needed to process refinement efficiently. Because 3D graphics system is very complex, we use incremental refinement approach to observe relationship among IPs. For refining efficiently and easily, we divide into four aspects (intra-IP, interface between IP and bus, inter-IP for the special function, and inter-IP for the whole system) to process refinement in Fig. 11 and Table 1 . For example, we refine intra-IP to improve performance by modifying parameters like cache. Also, we can process refinement stages in parallel, if the stages are independent, then we can also refine in parallel and hierarchy. 
Design Space Exploration for Proposed 3D Graphics SoC Refinement
This section shows that how to improve the system performance step by step, according to the approach mentioned above. Table 1 shows design space exploration utilizing UML for system refinement. The impact factors of system environment include memory data allocation, transfer mode, and system bus topology. A 3D graphics application benchmark is used that is an elephant, the image size is 640 × 480, owns 87,840 vertexes, divided by tile (32 × 32 pixels), and occupied 28 tiles.
Intra-IP -4 Banks SDRAM -
From Fig. 10 , the 3D graphics SoC accesses image data frequently and sequentially, and these transactions cause SDRAM to occur the status of the row-hit. If we apply banking technique, the access latency would be reduced efficiently. Also, the communication time of the GM, the TDM, and the RM will be decreased. Hence, for intra-IP refinement process, firstly, we add an attribute bank active in SDRAM class diagram. Secondly, we modify memory allocation in software and Cycle count() function of SDRAM in SystemC for banking latency. According to Fig. 9 , we partition memory into four parts: program code & data (PC), vertex buffer (V), tile list (TL) buffer, and Frame/Z Buffer (F) as shown in Fig. 12 . The PC part stores 3D graphics application software including code and 3D graphics object data. The V part stores vertex data that outputs from 3D graphics software and input to the GE. The TL produced by TDM in the GE. The RE and the DE access the F part to render and display.
In general, we put all data in sequential memory address like Fig. 12 (a) . We consider SDRAM access mechanism and reduce row-miss ratio to get smaller latency penalty. We put data of different types into different banks like Fig. 12(b) in SDRAM. The GM and the TD wrappers with the allocation of four banks have 48% and 34% performance improvement, respectively, because they continue to access V and TL buffers. The RE wrapper accesses frame buffer randomly and shares TL buffer with TDM, therefore it cannot increase row-hit through memory bank policy.
Interface between IP and Bus -Transfer Mode -
According to Fig. 10 , we can know that the GM, the TDM, and the RE access memory sequentially. Hence, we can change the bus transfer mode for performance improvement. In Fig. 9 , we can understand that the relation of communication between all IPs via bus. For interface refinement process, we need to modify SendAddr() of the master ports of the GM, the TDM, and the RE for external access process.
In real case, the AMBA 2.0 protocol supports single and burst transfer modes. In burst transfer mode, if the master with lower priority is transferring data, the master with higher priority waits until this transaction completed. In single transfer mode, the master with higher priority always is granted the bus whether if the master with lower priority is transferring data. Then, we consider the access latency of SDRAM. We assume that CAS latency needs 3 cycles, and row hit in SDRAM access. Figure 13 shows an example that explains the effect of different transfer modes. There are two master IPs, they are M1 and M2, respectively, and one slave IP on the AMBA AHB 2.0 bus. M1 requests a one-beat read transfer per two cycles, M2 requests four-beat read transfer per four cycles, and M1 has higher bus priority than M2.
According to Fig. 13 , M1 needs 12 cycles to finish two data transactions, and M2 needs 32 cycles to finish one data transaction in single transfer mode. In burst transfer mode, however, M1 needs 14 cycles to finish two data transactions, and M2 just needs ten cycles to finish one data transaction. M2, which transfers more data and has lower bus priority, gets higher performance in burst transfer mode. We modify the GE and the RE wrappers as burst transfer mode, to execute 3D graphics application benchmark, and gather the statistics information. In the field, geometry function includes the sequential operations of GM wrapper, GM operation and the TD wrapper. Because the sub-operation executes in parallel, the cycles of Geometry Function is less than the amount of the sub-operation cycles. Render Function includes RE Wrapper and RE operation. GM master wrapper improves about 66% performance of data access. GM idle time is increased because of the increase of GE data input and FIFO needs more time to wait the TDM to process data as shown in Fig. 5 . The TD wrapper and the RE wrapper improve 35% and 154% performance, respectively. Geometry function means the whole function of the GE. Because the GE includes some function blocks executing in parallel, the cycle time of geometry function is not equal to sum of cycle time of the GM wrapper, the GM operation, and the TD wrapper. Rendering function is the same as geometry function. According to Fig. 9 and Fig. 10 , we can observe that the GM and the TDM access VDB and TLB simultaneously. 3D graphics application usually accesses huge data, so to reduce memory access latency will improve significantly system performance. In this stage, we have proposed two methods to reduce memory latency. First, we increase the row-hit ratio in outside SDRAM. The second, we adopt scratchpad memory to store frequent access data. The TDM needs huge memory read/write to create the structure data of TLB. TLB owns two blocks: RootZone and DataZone. RootZone stores whole heads for every single linked list. DataZone stores whole nodes for each linked list. Because RootZone occupies small size related to screen and tile size, and the number of access is large. In this paper our RootZone size is less than 300 words ((640×480)/(32×32)). We put RootZone into SRAM to reduce memory access latency. We propose two methods to implement this concept. Method I puts RootZone into scratchpad memory, and Method II puts it into TD Master Wrapper as internal memory. When the TDM finishes the job, which creates tile list, system will write RootZone back to TLB. Method II uses internal memory that can avoid accessing RootZone from outside memory. TD gets higher performance and then reduces the GM idle time, so TD wrapper shortens the time of requesting system bus. That results GM wrapper has bus contention with TD wrapper, so the cycles of Method II's GM wrapper is bigger than Method I. However, Method II is still faster than Method I totally.
Inter-IP for the Whole System
In this stage, we consider to refine performance of the whole system. We concern about aspect of hardware and software. In hardware aspect, we reconstruct the hardware architecture to reduce bus contention. We use more buses to share the data traffic and try to find the optimal bus number and memory block allocation in Fig. 14 and Fig. 15 . In software aspect, the execution order is important for resource usage. We analyze the sequence of software benchmark and find out that we can improve parallelism of 3D graphics by We adopt multi-layer bus architecture to avoid bus contention efficiently and to share bus traffic, because 3D application needs huge data access. According to pipeline stage and the attribute of data, we put the ARM processor and the GM wrapper on AHB I and put TD and RE wrapper on AHB II like Fig. 13 . 'I' and 'D' mean I-cache and D-cache, respectively. 'S' and 'M' mean slave and master wrappers, respectively. If the IP has two master wrappers, we call them 'M1' and 'M2. ' On AHB II, TD and RE wrappers both access TLB and DMA, DE, and RE also access the same memory block, F/Z buffer. We partition memory on AHB II into two blocks: TLB and F/Z buffer. Using multi-layer bus architecture avoids resource contention at the same time between different blocks like Fig. 15 . That includes four masters and two slaves, and supports two transactions simultaneously at most.
GM and TD wrappers have great performance improvement and improve 47% and 57% separately with twolayer AHB bus. With multi-layer bus, TD wrapper gets 111% speedup, but the absolute number of cycles is small. The whole geometry function performance is similar to two-AHB bus architecture. The multi-layer bus architecture needs additional area cost including a interconnect matrix and two SDRAM controllers. Experimental result of incremental refinements is shown in Fig. 16 . Finally, as a result, the whole performance can be improved 198% at geometry function and 69% at rendering function, respectively. Table 2 shows use case description. Table 2 can help us to get each component of data usage relationship and its constraints for software refinement. We use two TLBs(TLB1 and TLB2), therefore. The GE dumps data to TLB1 at stage 1 and then the RE reads data from TLB1 and the GE dumps data to TLB2 at stage 2. From Fig. 10 , we know that Rendering function time: 2,922,827 cycles and Geometry function time: 4,674,031 cycles. So we put clear buffer operation and rendering function together like Fig. 18 . Figure 17 shows the process of 3D graphics frame pipeline. In Stage 2, Frame 1 uses rendering function to render the image and Frame 2 uses clear buffer to clear frame buffer simultaneously. It makes Frame 2 to confuse with the result of Frame 1. To correct the mistake, we modify the process flow like Fig. 18 that moves the function of clear buffer after geometry function. Figure 19 shows adopted benchmark (an elephant) and Table 3 shows the results in three different bus architectures. The column of sequence expresses 3D graphics ex- ecution of five frames one by one. The column of pipelining shows the execution of 3D graphics like Fig. 18 shown. Obviously, multi-layer AHB improves more performance than two-layer and single-layer AHB in pipelining. On the other hand, multi-layer and two-layer AHB increase some area cost that single-layer. It is still a trade-off on cost and performance.
Frame Pipeline by Using Software
Conclusion
System-level design suits complex system simulation. In general, we can gain the effective and better architecture from system-level exploration. We have proposed a systemlevel design space exploration model for a tile-based 3D graphics SoC refinement. This approach helps us to reach performance refinement step by step. In this paper, we have shown that the system performance can be improved 198% at geometry function and 69% at rendering function through proposed approach. This approach can analyze and profile the impact of system including hardware and software in early stage. This approach can also reach the required constraint efficiently and systematically. would like to thank Himax Technologies Inc. for their generous financial and technical supports.
