Abstract: The traditional post-TnL vertex cache (abbr. 'post-VC') in embedded GPUs (EGPUs) with only one vertex or unified shader does not fit to multi-shader EGPUs for two reasons. As multiple shaders run in parallelism, (a) the out-of-order vertex processing may raise the post-VC inconsistency that leads to cache the error data, and (b) it is very hard to detect in time which vertices are saved in the post-VC in the stage of vertex fetching, resulting in the low performance. In this paper, we propose a modified post-VC including a decoupling cache and a vertex batch in-order commit controller, which can guarantee that the data SRAM and index tag can be updated in-order according to the same replacement policy in the different stages of vertex processing. The function of the proposed post-VC is verified on a FPGA-based platform. Experimental results show that it increases the performance by an average of 172% and 80.6% compared to the EGPU without/with the traditional post-VC respectively at a little expense.
Introduction
Nowadays, various embedded 3-D graphic processing units (EGPU), such as PowerVR [1] , Tegra [2] and etc., are wildly equipped in mobile devices mainly for 3D games, navigation and other graphic applications. To seek the better visual experience, more complicated geometry models are introduced. The high dataintensity demands EGPUs to execute a huge number of memory accesses to fetch vertices, pixels and to read/write color or depth from/to external memory. However, current memory systems in the mobile devices cannot provide enough access speed and bandwidth required by EGPUs. [3, 4] firstly proposed a novel vertex cache with two parts before and after the vertex/unified shader, as shown in Fig. 1 , to cope with above problems. One is the pre-TnL (Transformation and Lighting) vertex cache (abbr. 'pre-VC') to save the bandwidth by reducing vertex access times from the external memory. Generally, 3-D objects are organized by the triangle strip to save storage [5] . In this way, only n þ 2 vertices can represent n triangle primitives in a strip, but same vertices can be referenced by different triangles multiple times. So, a post-TnL vertex cache (abbr. 'post-VC') as the part of vertex output buffer (VOB) is also proposed. It improves the EGPU performance by storing the processed vertices to avoid the redundant operations. In the stage of the vertex fetching, a new index is checked by the both caches at the same time. If the pre-VC is hit, vertex is fetched from the pre-VC to prevent external memory accesses. If the post-VC is hit, the vertex is neither read from the external memory nor processed by the shader. A new vertex fetching from the external memory occurs only at the case of miss in the both caches. Hence, the speed and bandwidth burden are relieved effectively.
The traditional vertex cache for single shader EGPUs

Existing issues in the vertex cache for multi-shader EGPUs
The traditional vertex cache is based on the EGPU with only one shader. Nowadays, the EGPU tends to equip multiple shaders. In multi-shader EGPUs, the architecture of the pre-VC can be still used. But, the vertex fetching, execution, and commit are parallel and out-of-order among the multiple shaders, which brings new challenges to the post-VC cache design. (1). Function Issue: The data SRAM of the traditional post-VC is updated immediately after the vertices are output from the shader. However, in multi-shader EGPUs, more than one shader may request to update data SRAM concurrently. In this case, one vertex that has been hit in the post-VC may be replaced by other processed vertices before the primitive assembly (PA), which leads to the cache inconsistency and error store.
(2). Performance Issue: Because multiple shaders fetch and process the vertices in parallel, the post-VC hit detection becomes more difficultly. When the vertex fetcher (VF) is ready to pick up a vertex, this vertex may be running in some shader and not yet be stored in the post-VC. So it will be repeatedly executed to decrease the EGPU performance due to many redundant operations. In other words, it is very hard to accurately detect which vertices will be buffered in the post-VC in the stage of vertex fetching in time for multi-shader EGPUs.
For continually playing the role of the post-VC in the multi-shader EGPUs, in this paper, we modify the post-VC architecture as shown in Fig. 2 (the bold blocks are the modified parts). Firstly, the data SRAM and index tag of the post-VC are decoupled and integrated into the PA and the VF, respectively. Then, a vertex batch in-order commit controller (VIOC) is introduced into the EGPU task scheduler to guarantee the data SRAM and the index tag are updated by the same replacement policy and the order in the different vertex processing stages. Experimental results show that not only the post-VC function is correct, but also the EGPU performance is significantly improved with a little hardware cost. Thus, it is a good alternative post-VC for future multi-shader EGPUs.
2 The modified post-TnL vertex cache for multi-shader EGPUs 2.1 Vertex batch Generally, in current multi-shader EGPUs, the vertex batch as the basic unit is processed by each shader [6] . As depicted in Fig. 3 , it consists of two data sets. One is the vertex set storing the vertex attributes to be shaded or the corresponding results. The entry number of the vertex set, NV, is equal to the thread number of the shader (i.e., NT). In this way, each thread is distributed one vertex to fully utilize the shader's computation resource. The other is the primitive set that is organized by some triplet entries. Each entry contains three vertex indices that can be used by the PA to assemble a triangle primitive. Every index is the entry address of the vertex set or the data SRAM in the post-VC. In theory, under the certain area cost, the entry number of the primitive set, NP, can be any value more than "NV À 2". But, in Section 3.2, we point out that the post-VC can achieve a better tradeoff based on some specific primitive set size.
The VF reads the vertices in the order of the strip index. Then, as long as the vertex set and primitive set are not full, the vertices not hit in the post-VC are inserted into the vertex set, and all the indices of fetched vertices are stored in the primitive set. When any of two sets is full, a vertex batch is built and sent to the unshaded vertex batch buffer. The EGPU task scheduler selects one vertex batch from the buffer and issues it to one idle shader. Note that a primitive set can contain triangles consisting of more than NV vertices in the vertex set, because vertex indices can point not only to vertices in the vertex set but also to vertices in the post-VC. Only the vertices in the vertex set are processed by the shaders. Other vertices are all hit in the post-VC, which do not need to process again to improve the performance.
The decoupling post-TnL vertex cache 2.2.1 The data SRAM architecture
For the function issue, we decouple the data SRAM from the VOBs of shaders and integrate it as the buffer of PA. Every shader still has its own VOB to store the processed vertices and commit them to the PA in-order under the control of the modified EGPU task scheduler (Note that we only focus on the data SRAM in this section and all the details about the task scheduler are shown in Section 2.3). Then, after completing the PA, the data SRAM can be updated. Because the in-order update occurs after PA, the vertices in the post-VC cannot be replaced wrongly to cause the inconsistency.
The data SRAM architecture is depicted in Fig. 4 . Because the vertex batch is the basic processing unit which contains NV entries in the vertex set, the data SRAM is also built by a multiple of NV entries. In Fig. 4 , the data SRAM is composed by N slots each of which has E (E ¼ NV) entries. So, the data SRAM can store N Â NV recently processed vertices. When a vertex batch is committed, according to the indices in the primitive set, the PA reads the processed vertices from the vertex set or data SRAM to assemble triangle primitives. After that, one whole slot in the data SRAM is replaced by the vertices in the vertex set with the help of the EGPU task scheduler. Generally, because most of the 3-D graphic applications are stream style, the FIFO replacement policy is adopted.
The index tag architecture
The fundamental reason of the performance issue is that, in the traditional post-VC, the index tag and the data SRAM are centralized and updated together after the vertices have been processed. In this way, the index tag update is too late to accurately detect if some vertex is hit before it is input into the shader, because some other shader may be processing it at the same time. So, we forward the index tag to the VF to record the indices of the vertices that have been loaded recently so as to accurately detect in time which vertices will be cached. The index tag architecture is shown in Fig. 5 .
The index tag also consists of N slots including N Â NV entries, which is the same as the data SRAM. The VF collects the recent fetching history (i.e. N Â NV vertex indices) that is leveraged to record the contents of the post-VC when the current vertex batch is committed. Firstly, when the VF sends a load request in the strip index order, this index will be transferred to the index tag to detect if cache is hit. If hit, it means that this vertex has been in some built vertex batch and does not need to be processed. Only the hit cache entry address is added to the primitive set. Otherwise, we need to detect if this vertex has been included in the vertex set of the current batch. If so, the address of the vertex set entry is added to the primitive set. If both miss, this index is sent to the pre-VC to load the vertex from it or external memory to the current vertex batch. When the current vertex batch is built, some slot of the index tag is replaced at one by the strip index of the vertices in the vertex set rather than after the PA. The replacement policy is also the FIFO. In this way, not only which vertices will be hit in the post-VC can be detected in time to avoid unnecessary operations, but also the update of the index tag and data SRAM are consistent by the same replacement policy.
Moreover, to complete the PA, the local vertex indices in the primitive set are used to read vertices. If the post-VC is hit, the index in the primitive set is the entry address of the data SRAM. If not, the index is the entry address of the vertex set. So, the index of the primitive set is (log 2 dN Â NVe þ 1)-bit in which the lower (log 2 dN Â NVe)-bit is used to access N Â NV entries in the data SRAM or the lower (log 2 dNVe)-bit is used to read the vertex set, and the highest bit is used to distinguish these two address spaces. So the vertex index in the primitive set is local, which is different from the 16-bit global vertex index of the strip in the index tag and vertex set.
The vertex batch in-order commit controller architecture
Although the decoupling post-VC can accurately record the recently processed vertices in the different stages of EGPU, it still cannot guarantee that the vertex batches processed out-of-order are committed to the PA in-order, which may cause the inconsistent update of two parts in the post-VC to make the PA load the error data. So, a vertex batch in-order commit controller (VIOC) is added to the EGPU task scheduler. The task scheduler is responsible for the issue and commit of the vertex batches. It is connected with all the shaders and owns the status of all the vertex batches, which contributes to realize their in-order commit.
A 16-bit ID is assigned to every vertex batch to record its order number. A register called batch_id_r is set in each shader, which is updated by the ID of one vertex batch when it is issued to this shader. The value of the batch_id_r is output to the task scheduler to assist the in-order commit. The VIOC architecture is illustrated in Fig. 6 . A Shader Status Table (SST) is constructed to record the status of all the shaders. The number of entries is the same as the number of the shaders. Each entry has 4 items: 1-bit ready flag (R), 1-bit shader type flag (T), 1-bit running status flag (S) and a 16-bit vertex batch ID. The R flag represents if this shader is idle. The T flag means that this shader is processing the vertex or the fragment. The S bit indicates that if the results of this shader are ready. In each execution cycle every shader updates the corresponding entry in the SST. Furthermore, a register called next_batch_r is set to track the order number of the next committed vertex batch. The original value of this register is "0" and the increment is "1" after committing one vertex batch. A vertex batch can be committed to the PA as long as the following conditions are met simultaneously:
(1). The shader processing this vertex batch is not idle, i.e. the R flag is "1", and the vertex program is executed (T ¼ 1).
(2). This shader has completed the computation and is waiting for committing the results, i.e. the S bit is "1".
(3). The vertex batch ID of this shader is equal to the value of the register next_batch_r.
In this way, the shader that can commit the processed vertex batch is selected by the simple logic gates. The task scheduler triggers the corresponding shader to transfer its vertex batch to the PA, then, the data SRAM is replaced according to the FIFO policy. The VIOC can ensure the update consistency between the index tag and data SRAM. Therefore, the function correctness of the post-VC for the multishader EGPUs is fully satisfied.
3 Experimental results and evaluation 3.1 FPGA prototype and evaluation methodology To test the function of the proposed post-VC, a FPGA-based 3-D graphics SoC prototype is prepared by integrating 4 embedded 4-threaded SIMD vertex shaders (VS) [7] , one PA and one rasterization engine (RE). The configuration parameters are listed in Table I . As shown in Fig. 7 , it is built on a multi-FPGA experimental platform (i.e. BEE3 [8] ) owning 4 Xilinx Virtex-5 FPGAs that are connected each other with a high-speed ring interconnection network. Each FPGA is equipped a 4G DDR2 memory that is enough to store the 3-D graphic models. In the FPGA a , a Microblaze as the master processor of the SoC executes the 3-D graphic applications and launches the VSs to render. Then, the VF with the index tag (also in the complete triangle assembly. The RE in the FPGA d receives the triangles to generate the fragments that are finally output to the framebuffer and displayed on the screen. The whole FPGA prototype runs at 67.13 MHZ. A teapot benchmark with 320 Â 240 resolutions is rendered successfully, which verify that the function of the proposed post-VC is correct.
To evaluate the post-VC performance, 6 widely used 3-D graphic benchmarks listed in Table II [4, 10] are employed. And they are all optimized by NVTriStrip [9] to generate triangle strips. These benchmarks are executed on the above FPGAbased prototype, and some hardware counters are added to record the performance statistics such as number of vertex batches, render cycles, cache miss number, and so on. To save execution time, only three vertex attributes (vertex coordinates, normal vector and 2-D texture coordinates) are used. Because there are only VSs on the prototype, the shader program only consists of the coordinate transformation and lighting and does not have the texture operations. 
The vertex batch size
Because the vertex batch is the basic unit processed by EGPUs, its size may directly affect the load of shaders, the vertex cache miss rate and so on. For the 4-threaded shader on the FPGA-based prototype, there are 4 entries in the vertex set at most, so we only need to discuss the primitive set size. Fig. 8a shows the normalized number of the vertex batches produced by various primitive set size. As the size is increased, the number of the vertex batches continues to decline. The reason is that the post-VC is used to store the processed vertices that do not need to be added to the vertex set. Only the corresponding hit addresses are sent to the primitive set. Therefore, if the number of entries in the primitive set is too less, one vertex batch may be built early even the vertex set is not full, which results in more vertex batches. In Fig. 8a , for 8 entries, the vertex batch number is decreased by 61.7%. But the downward trend almost disappears when the primitive set size is larger than 8. This is because that, for the primitive set with 8 entries, the most of vertex sets have enough vertices, i.e. 4., which is proved by Fig. 7b . We can observe that when the primitive set size is too small, Most vertex batches do not have 4 vertices, e.g. about only 40% vertex batches own 4 vertices for 4 primitive set entries. The insufficient vertices will lead to the computation resource waste and unbalance load of the shaders. When there are 8 entries in the primitive set, the number of vertex batches with 4-vertex reaches 96.6%. If the entry number is further increased, not only the vertex batch number is not continually decreased, but also more hardware cost will be consumed. Fig. 9 shows the miss rate of the post-VC based on the different primitive set sizes. When the size is larger, more processed vertices can be stored and the miss rate is further decreased. The post-VC miss rate for the primitive set with 8 entries is only 29.1% and about 25% less than that of 2 entries. So, the primitive set size is finally set to 8 entries like Table I . 11 shows the performance gain by the 16-entry post-VC using different architecture candidates, and they are normalized to the value of the architecture with no post-VC. The traditional post-VC refers to that the index tag and the data SRAM are all in the PA and replaced simultaneously after one vertex batch is committed by the VIOC. The results show that the performance gain from the modified post-VC is about 2.72 times as much as the architecture with no post-VC, and also 80.6% more than that of the traditional post-VC. The index tag of the modified post-VC is moved to the VF and updated as long as one vertex batch is built. In this way, more vertices that will be stored in the post-VC after commit can be detected as hit accurately in time even when they are running in the shaders, which avoids many redundant operations and results in more performance improvement. Table III lists the storage cost of the post-VC for the 4-shader EGPU prototype. As the discussion in Section 3.2, the primitive set has 8 entries, and each entry stores 3 local vertex indexes. Each one is ðlog 2 dN Â NVe þ 1Þ ¼ 5-bit. Furthermore, there is one more 1-bit enable flag in each entry to identify if it is used. The total size of the primitive set is 128-bit. Each entry of the SST in the VIOC corresponds to one shader. So, the SST size is 76-bit (i.e. 19-bit Â 4). The post-VC has 16 entries to store the vertex attributes. According to the Shader Model 3.0, each vertex at most owns 16 128-bit attributes. The overall size of the post-VC is 4 KB. Except that the post-VC is implemented by the 1-port SRAM, the others are all based on the registers. Even if the number of the shaders increases, because the scale of the 3D models is fixed, the storage area of post-VC does not need to change significantly except for the SST. A Verilog HDL implementation of the post-VC is synthesized in SMIC 0.18 µm CMOS technology. The logic part consists of about 1.8 K logic gates that can be ignored compared with the shader hardware cost (300 K logic gates) [7] . 
The post-TnL vertex cache hardware cost
Conclusion
For the multi-shader EGPUs, we have proposed a modified post-TnL vertex cache with two parts. One is a decoupling post-VC including the data SRAM as the data buffer of the PA and the index tag in the VF. The other is a vertex batch in-order commit controller in the task scheduler. The former can accurately detect whether a vertex is in the post-VC in the stage of vertex fetching. The latter can guarantee that both parts can be updated in-order according to the same replacement policy. As a result, the performance and function issues can be solved when the post-VC is applied to the multi-shader EGPUs. Experimental results show that the modified post-VC with 16 entries can improve the performance by about 80.6% compared to the traditional post-VC. And its function is also verified by a 3D-graphic prototype with 4 vertex shaders on the FPGA.
