This paper introduces a hardware engine for rendering two-dimensional vector graphics based on the OpenVG standard in portable devices. We focus on two design challenges posed by the rendering engines: the number of vertices to represent the images and the amount of memory usage. Redundant vertices are eliminated using adaptive tessellation, in which the redundancy can be judged using a proposed cost-per-quality measure. A simplified edge-flag rendering algorithm and the scanline-based rendering scheme are adopted to reduce external memory access. The designed rendering engine occupies approximately 173K gates and can satisfy real-time requirements of many applications when it is implemented using a 0.18 μm, 1.8 V CMOS standard cell library. An FPGA prototype using a system-on-a-chip platform has been developed and tested.
Introduction
The OpenVG standard was constituted by the Khronos Group to provide an application programming interface for hardware-accelerated two-dimensional (2D) graphics [5] . Since the OpenVG Specification 1.0 was released in 2005 [5] , several studies have reported on the implementation of the standard. Lee et al. [8] have presented the first commercial OpenVG implementation in software. However, 2D vector graphics are mainly intended for portable devices, which generally require hardware acceleration when rendering 2D vector graphics. It is because softwareonly solutions frequently fail to meet the real-time requirements of the devices. Thus, several works, including [9] , proposed the use of powerful multimedia processors or three-dimensional graphics engines in rendering OpenVG graphics. However, numerous low-end portable devices cannot afford these expensive solutions.
The real-time requirements of OpenVG applications can also be fulfilled by adding a dedicated hardware engine. An OpenVG hardware engine with a dual-scanline rendering approach and an active-edge management scheme has been introduced in an earlier work [6] . Seo et al. [12] have shown that a more efficient active-edge management approach can improve rendering speed. There also have been studies, including [7] and [13] , that have reported on the implementation methods of various components of the hardware engine. However, none of the published studies on the OpenVG engine has described the hardware implementation of the geometry component, which distinguishes OpenVG from other standards.
This paper introduces a hardware OpenVG rendering engine equipped with a geometry processor (GP) which processes the geometry component in the hardware. The GP produces the vertices in the drawing surface for rendering the objects specified by the OpenVG instructions.
The number of the vertices affects the rendering cost, as well as the output quality. This paper proposes three parameters to determine the number of vertices and a metric for optimizing the selection of the parameters. Selection considers both the cost of storing of the vertices and the quality of the image.
The rest of this paper is organized as follows: Sect. 1.1 discusses the challenges posed by the rendering hardware for 2D vector graphics; Sect. 2 presents a brief overview of the implemented OpenVG rendering engine; while Sect. 3 describes the GP engine and the algorithms adopted in the engine. The following section provides an overview of the three remaining functional modules in the proposed rendering engine. Section 5 presents the experiment results, and Sect. 6 puts forward the conclusions.
Challenges in 2D Vector Graphics Hardware
In developing a hardware rendering engine for 2D vector graphics, two challenges should be added to the common list of design challenges for hardware accelerators. These are the large amount of vertex information and the large amount of memory usage. These two challenges should be considered especially for small portable devices.
Amount of vertex information In the OpenVG application programming interface (API), all geometries must be defined in terms of one or more paths, each of which is defined by a sequence of segment commands [5] . To process these segment commands, we need to draw lines, Bézier curves, and elliptical arcs. In practice, these curves and arcs are approximated using a number of short line segments, each of which is specified by two end vertices. A vertex Copyright c 2011 The Institute of Electronics, Information and Communication Engineers can be defined by the coordinates and the tangent. A highquality vector image requires a large number of vertices to be processed, thus the amount of vertex (or edge) information is often huge for small portable devices. Generally, this is the main cause of time and energy consumption of the hardware. Therefore, it is necessary for a 2D vector graphic rendering engine to adopt the adaptive tessellation scheme to be able to handle as few vertices as possible with the least loss of image quality [1] , [10] , [12] .
Amount of memory usage Even after the adaptive tessellation scheme is applied, the amount of image data remains huge for most 2D vector graphics. A practical vector image can be composed of hundreds of paths. When a rendering engine renders an image in a path-by-path manner, as in [4] , [6] , and [12] , the engine renders all the edges of one path, and then begins to render the next path. Numerous overlapped paths are expected in complex vector images because paths can be arbitrarily defined by the user. Hence, one scanline can be repeatedly processed for multiple paths. Thus, the same memory access can be repeated even for the invisible pixels, which belong to the overlapped paths. Therefore, adopting a mechanism to avoid these repeated memory accesses is necessary. A smarter way of managing the image data is also needed. Figure 1 shows the architecture of the OpenVG rendering engine introduced in this paper. The engine consists of four modules, namely, GP, Tessellator, Rasterizer, and PixelPipe.
Overview of the OpenVG Rendering Engine
The rendering process can be divided into two parts: vertex processing and edge processing. The introduced OpenVG rendering engine is pipelined in both the vertex and edge processing parts. Figure 2 shows the pipelining of the functional modules. In the vertex processing part, the GP and Tessellator modules are pipelined to generate the vertices and edges of the whole image frame in a pathby-path manner. In the edge processing part, the Rasterizer and PixelPipe modules sequentially render the scanlines in a scanline-by-scanline manner; that is, the modules render all the paths spanning one scanline (in a path-by-path manner) and subsequently begin to render the next scanline. Although the scanlines are sequentially rendered, the active edges for the next scanline are generated while rendering the current scanline.
Geometry Processor (GP)
The GP module fetches OpenVG API instructions from an external memory and produces the vertices in the drawing surface. The OpenVG API instructions are composed of two groups: segments and operands. The segments indicate the geometric objects, i.e., lines, Bézier curves, and elliptical arcs. The operands are represented in a 28-bit truncated floating-point format and indicate the coordinates and control points of the Bézier curves, or the radii and rotation angles of the ellipses.
The GP employs the adaptive tessellation scheme, for which this paper proposes a method, deemed as an improvement on the scheme proposed in an earlier study [12] . The number of vertices is further reduced by refining vertices, i.e., by merging small vertices.
Adaptive Tessellation of Elliptical Arcs
The OpenVG API defines an elliptical arc using the start and end vertices on the ellipse, two radii (R h and R v ), and the rotation angle (α). The drawing direction (clockwise or counter clockwise) and the arc length (long arc or short arc) are also given. To draw an arc, the renderer approximates the arc with a number of short line segments by inserting more vertices. The GP contains a heuristic algorithm that uses simpler and fewer calculations when finding the vertices. This is in contrast with the algorithm presented in an earlier study [12] , which is considerably simpler than the algorithm in the OpenVG reference implementation [5] .
In the proposed algorithm, the following matrix is used to affine-transform the ellipse into a unit circle, which is shown as the solid circle in Fig. 3 .
Subsequently, an extended circle (the dotted circle in Fig. 3 ) with radius R, which occupies the same area as the ellipse, is used to determine the number of internal vertices on the unit circle. A parameter, ElpDis, is defined to control the precision of smoothness. If R ≤ 1, then the ellipse will be painted as one pixel in the drawing surface. Assuming R > 1, then the angle θ Fig. 3 Geometric meaning of ElpDis.
should be calculated such that every pair of inserted vertices satisfies Eq. (2):
Approximating the term sin π 2 − θ by π 2 − θ with an error compensation factor of π/6, the angle θ can be determined by Eq. (3).
It should be noted that the expensive calculations, such as SLERP and arctangent, which have been required in the earlier studies [5] , [12] , can be avoided. For large extended circles (R → ∞), θ = 0.04 radian (or 2.7
• ) is determined using Eq. (3). While drawing the longest horizontal arc on the display of XVGA size, one pixel mismatch can happen only when the radius R is larger than 25600 pixels (R = 1024/0.04). For a very small circle with a radius of 2 pixels, the angle difference is as big as 60
• between two successive vertices that are apart by one horizontal pixel. However, the proposed algorithm inserts a new vertex every 30.9
• when ElpDis is defined as one pixel.
Finally, vertices on the unit circle are affinetransformed back into the ellipse. The matrix for transforming from the unit circle to the ellipse guarantees that more vertices will be inserted to the sharp corners of the elliptical arc.
Algorithm 1 formally describes the proposed algorithm for the adaptive tessellation of an elliptical arc.
Adaptive Tessellation of Bézier Curves
Whenever a segment of a Bézier curve, defined by the four control points as shown in Fig. 4 , is not flat enough to be approximated as a line, the GP divides the curve segment into two. The proposed algorithm converts a quadratic curve into a cubic curve using the degree elevation algorithm [11] and simplifies the algorithm presented in [1] . The proposed algorithm defines two distances, denoted as dis p1 and dis p2 as shown in Fig. 4 , from the control points to the baseline of the curve segment. The flatness of the curve segment is judged by comparing the summation result with a userdefined parameter BerDis. If the curve segment is not flat enough, the GP divides the segment into two sub-curves using the de Casteljau algorithm [10] . This procedure is applied recursively until all sub-curves become flat. Table 1 summarizes the computational cost of adding one vertex onto an elliptical arc or a Bézier curve. The proposed algorithm can avoid the expensive arctangent calculation which has been required in earlier studies [1] , [12] . A significant reduction of computational cost can also be observed for other operations.
Vertex Refinement
As pointed out by Kim et al. [6] , most edges in OpenVG [5] . The proposed rendering engine adds one more step, called vertex refinement, after the adaptive tessellation process. In this step, vertices are merged in the fill path to reduce the number of edges. While transforming the image from the user surface to the drawing surface, we merge the two vertices if any two successive vertices (except the start or end vertices) are located in a box, the side length of which is less than a threshold value VerDis. This merging process decreases the number of edges dramatically, especially when the image that should be rendered is small in the drawing surface.
Tradeoff of Image Cost and Quality
The peak-signal-to-noise ratio (PSNR) is one of the most popular metrics used for representing image-quality loss in a pixel-level evaluation. Wang et al. [15] have introduced the structural similar index (SSI) to evaluate the image quality with consideration to the structural level. Although a better image quality can always be achieved by increasing the number of vertices, an increase in the number of vertices does not linearly benefit the image quality in most cases. Moreover, achieving a very high image quality does not necessarily make the image more pleasing to the human eyes. Thus, a PSNR of 30 is considered to represent the acceptable quality of the images in this paper.
To balance image quality with implementation cost, this paper also proposes the use of a cost function which is defined as Eq. (4) for the i-th configuration of parameters.
where cost is the number of vertices and F, the factor of image quality, is defined as Eq. (5).
Note that F is employed to consider image structure errors. It is defined as a function of PSNR and the normalized SSI. Normalized SSI is defined by Eq. (6).
where MaxSSI and MinSSI stand for the maximum and minimum SSI values, respectively. Note that F = 1, when PSNR equals 30 or when SSI is the maximum over all the configurations of parameters. Even when the image quality in PSNR is better than the target (i.e., larger than 30), the cost function will increase if the number of vertices quickly increases. Hence, a smaller CostQ indicates a more appropriate vertices usage for image quality.
Other Modules

Tessellator
In Fig. 1 , the Tessellator engine reads the vertices produced by GP, from the vertex FIFO and generates the corresponding edges for the Rasterizer. If the stroke operation is invoked by users, the tangent value and the stroke width are used to generate the stroke boundary edges; otherwise, the vertices are connected sequentially to generate the edges for the Fill operations. The linked-list edge management module (LEM) manages the edge list for rendering graphics in a scanline-by-scanline manner [13] . As explained in Sect. 4.3, the arranged edges will be stored in an external memory while all the paths are being processed.
Rasterizer
In Fig. 1 , the Rasterizer engine reads the edges from an external memory and renders each scanline in a path-by-path manner. The edges spanning the next scanline are generated by the linked-list active-edge generation module (LAEM) while rendering the current scanline. Two FIFOs, the Info FIFO and Flag FIFO, are used to store the path configuration and the flag bits. The double buffering scheme is adopted between the Rasterizer and PixelPipe for parallelization. For efficiency in rendering in terms of internal memory usage, an edge-flag algorithm similar to the one proposed by Shen et al. [14] is adopted in this module. The edge-flag algorithm in [14] uses one counter per pixel, whereas the usual edge-flag algorithms [4] use one counter per sample point. Thus, the algorithm in [14] consumes less memory and performs less computations, and operates well unless multiple edges overlap with each other in one path, which is very unusual in practical applications.
PixelPipe
The PixelPipe engine reads information from the Rasterizer and determines the final color of the pixel. The PixelPipe fully supports OpenVG functions such as blending, gradient, image, and anti-aliasing.
The OpenVG API describes vector graphics in a pathby-path manner [5] . Generally, each path represents a polygon. The pixels inside the polygon are anti-aliased and subsequently painted. If the polygons overlap with each other, the color of the overlapped pixels should be determined by the blending operation, which requires the previous color determined by the previous polygons. Thus, it is usual for a common OpenVG application to paint a large number of pixels repeatedly. Table 2 shows the ratio of overlapped pixels for the benchmark images shown in Fig. 5 . Four benchmark images are used in this table. These are Tiger [5] (in which stroke operations dominate), Subway [3] (which contains many letters), and Manga [3] and Clock [3] (in which fill operations dominate). A higher ratio of overlapped pixels implies more repetitions of access to the frame buffer.
Since the frame buffer is generally implemented in an external memory, its accesses exact timing and energy burdens on the whole system. To write to the frame buffer only once for each visible pixel of the final image, the rendering engine introduced in this paper renders graphics in a scanline-by-scanline manner. After the edges for all paths are prepared in a linked-list structure, the engine processes each scanline only once for all paths, that is, it processes all paths spanning one scanline at a time. One disadvantage of this method of rendering is that the edges have to be stored in an external memory while processing all the paths, because the amount of data for the edges spanning a scanline is frequently much larger than the capacity of the on-chip memory in portable devices. On the other hand, scanline-by-scanline rendering enables the use of a scanline-sized internal frame buffer, called Tile frame buffer [13] . When using the Tile frame buffer, the accelerator can avoid repeated access to the external frame buffer and thereby reduce external memory access considerably for the images in which numerous objects overlap with each other [13] . The Tile frame buffer, which is not a big burden for the internal memory, is implemented in the PixelPipe engine.
An Example
A linked list of slots are used for a scanline to manage the edges from multiple paths (spanning the scanline) in the LEM module of the Tessellator engine. Initially, a fixedsize memory space is assigned to each slot in the list. Then, the edges are inserted into the slot according to their upper y coordinate. Let us consider, as an example, Fig. 6 with two overlapped paths: a rectangle and a triangle. These two paths consists of the edges denoted as 1 to 4 and 5 to 7. An end tag, denoted as "C", is inserted to indicate the end of the list. If an additional slot is needed, the last position of the last slot contains the address of the next slot, denoted as "A" in the figure.
For the linked-listed edge management, the difference between the algorithm in [12] and the one proposed in this paper is that the proposed algorithm inserts the path IDs to identify the path to which the edges belong. In managing the linked-listed active edges, two internal memory modules are used as the next active edge list (NAEL) for pipelining the processes of two consecutive scanlines.
After the Tessellator finishes preparing all the edges for one image frame in the external memory, the Rasterizer sequentially processes the active edges of each path spanning the current scanline. The Rasterizer begins processing all the active edges of the first path and setting the flags in the even Flag FIFO. The pieces of information (such as color, filling rule, etc.) for the first path are pushed into the even Info FIFO. After the Rasterizer finishes one path, PixelPipe will read the even Flag FIFO and even Info FIFO, calculate the color for the pixels on the scanline, and write the resulting color to the Tile Frame buffer. At the same time, the Rasterizer processes the active edges of the next path and sets flag bits and path information to the odd Flag FIFO and the odd Info FIFO. For the blending operation, the previous resulting color is read from the Tile frame buffer instead of the frame buffer. After all paths are rendered in the current scanline, the contents of the Tile frame buffer are copied to the (external) frame buffer; and then the Rasterizer moves to the next scanline. 
Experiment Results and Analysis
Vertex Usage and Image Quality
In evaluating the image quality of the proposed rendering system, four test images, as shown in Fig. 5 , are used. Table 3 shows the vertex usage (the number of vertices) and the PSNR values for the test images. User pre-defined parameters (ElpDis, BerDis and VerDis) are set to 1 pixel in the table. The PSNR values are calculated for the images obtained by the proposed method using the images obtained by the algorithms in [1] and [12] as references. The proposed adaptive tessellation scheme reduces the number of vertices by approximately −32% to −67% for the test images, whereas the PSNR is maintained to be at least 28.41 for all the images considered.
Subway has numerous small and short curves and is not sensitive to the change in the number of vertices. Clock has dominantly large circles that can always achieve high image quality. Hence, only Tiger and Manga are considered in this paper when evaluating cost and quality. A VGA-sized screen is assumed. In the experiments, the three parameters (ElpDis, BerDis and VerDis) are configured to have the same value for each configuration. The configurations are in the range from 0.1 to 8.0 with an interval of 0.1. In Figs. 7 and 8, the Y axis indicates the PSNR value and the vertex usage, respectively. The X axis indicates the configuration of the three parameters. The number of used vertices nonlinearly affects image quality. As expected, however, better images are obtained when more vertices are used.
Figures 9 and 10 depict CostQ as a function of the parameter configuration. The figures show that the minimum CostQ, which refers to the best cost per quality value, can be observed at configurations 1.0 and 4.6 for Tiger and Manga, respectively. Since the CostQ values are extremely large when the configuration parameters are greater than 3.0, Fig. 9 only depicts the range in which the minimum CostQ can be easily observed. This general cost-quality evaluation metric can help users obtain the most efficient configuration.
External Memory Usage
For high-quality vector images, the vertices occupy a large part of the capacity of external memory, compared with API instructions. Hence, compared with the existing rendering engines considered in this paper [1] , [6] , [12] , the proposed OpenVG rendering engine occupies significantly less external memory for storing the vector image. Figure 11 shows that the OpenVG API instructions occupy 18.72% to 62.17% of the memory capacity that the vertices occupy.
Rendering Speed and Hardware Cost
The proposed OpenVG rendering engine has been modeled in Verilog, verified using Cadence NC-verilog, and synthesized using Synopsys DesignCompiler with a 0. Table 4 compares the speed of the path-based rendering [12] , [13] with that of the proposed scanline-by-scanline Fig. 11 Ratio of the memory usage for storing API instructions to that for storing vertices. rendering. The operation speed of GP is included for both rendering methods in Table 4 . The scanline-by-scanline rendering method benefits test images with large areas of overlapped painting, thus significant speedup results are observed for Tiger, Manga, and Clock. On the other hand, Subway is slowed down because it occupies less area but contains a large amount of vertices. However, final rendering speed can still satisfy the real-time requirements. The designed rendering engine has been prototyped and verified using an FPGA kit with Xilinx xc5vsx95t. The system-on-a-chip platform presented in an earlier study [12] is used in the prototype. A demonstration of the operation of the rendering engine on the FPGA kit is shown in Fig. 12 . A motion image, GP-Motion, has been demonstrated in Fig. 12 (d) to verify the function of GP † .
Conclusion
This paper presented the design of an OpenVG rendering engine that can efficiently render 2D vector graphics. Heuristic algorithms were proposed to enhance the efficiency of the design with respect to the number of vertices to handle and the amount of memory access. We showed that the proposed heuristic algorithms eliminated redundant vertices, used less internal memory, and reduced external memory access. Moreover, we demonstrated that the proposed cost-per-quality metric could be used to adjust the parameters of the algorithms. Experimen-
