I. INTRODUCTION
Recently, real-time 3-D graphics (3DG) have been widely employed on mobile devices such as cellular phones, car navigation systems and portable media players. At the heart of these devices are 3DG chips that provide the computing power for the mobile applications [1] - [7] . These chips are classified, according to their architectures, into programmable 3DG accelerators and full hardware ones. Programmable 3DG accelerators are used to increase flexibility by allowing different programs to be used on the chip. However, this results in very complex processor architectures [2] - [5] , and these chips generally have high power consumption. The full hardware 3DG accelerators generally consume less power than programmable 3DG accelerators, but the power consumption is still high in spite of a hardware optimization [1] , [6] .
We propose a 3DG SoC with a fully pipelined hardware geometric and rendering engine and a clock and power management unit (CPMU) which supports several operation modes for various applications. These operation modes can be reconfigurable to reduce power consumption and to meet various needs of mobile applications with a clock gated scheme and a clock divider. To achieve low power consumption in a normal display mode without the 3DG operation, a liquid crystal display (LCD) bypass mode is designed that consumes only 4.32 mW. The bypass mode allows the 3DG SoC to be inserted between a baseband processor and LCD module. The low power consumption of this SoC allows it to be employed into existing mobile devices with minimum changes for 3DG acceleration. 
II. 3DG SOC ARCHITECTURE AND APPLICATIONS
There is a tradeoff between flexibility and power consumption in the design of the SoC. A high flexibility of 3DG requires a high performance processor with VLIW architecture or a SIMD instruction set [2] - [5] , but a full hardware engine allows low power consumption because it has optimized function units for 3DG. We choose the fully hardware pipelined 3DG accelerator for low power consumption. To support various applications and utilize the advantages of SoC [3] , we integrated the ARM9 reduced instruction set computer (RSIC) core, dual SDRAM controller and CPMU. The CPMU supports several operation modes: "normal mode," "3DG application mode," "3DG acceleration mode," and "bypass mode."
The 3DG SoC consists of three masters and four slaves as shown in Fig. 1 . The three masters are ARM9 core, direct memory access (DMA) controller, and 3DG accelerator; the four slaves are a memory controller, internal memory controller, APB peripherals, and SDRAM controller. Intellectual properties (IPs) are integrated into a three-layer AMBA bus system which allows simultaneous access to three buses from masters to slaves. The memory controller supports flash, ROM, SRAM, and SDRAM. There is another SDRAM controller for the 3DG accelerator and the display controller. To avoid collisions among LCD, ARM9, and 3DG and to be able to produce a stable display on LCD, the output arbiters (shown in Fig. 1 ) support a round-robin and a programmable arbitration.
Mobile devices have various system applications and features according to performance, power consumption, and customer needs. To satisfy these features, the 3DG SoC allows users to select a combination of baseband, application processor, 3DG accelerator, and 3DG SoC at system level. So, the proposed 3DG SoC is configured to the normal operation mode, 3DG application, and 3DG acceleration mode according to the application to minimize power consumption and performance in the SoC point of view. Fig. 2(a) is the normal mode for a low-end application, in which all blocks operate with a maximum performance. In this application, the ARM9 RISC is used as a multimedia application which is a low-end 2-D video/image processing and a 3DG application program interface (API) at the maximum frequency in the normal mode, and the 3DG accelerator is used for 3DG acceleration.
In high-end application, the SoC can be configured to the 3DG application or 3DG acceleration mode according to where the 3DG API is located. The high-end application supports a high bandwidth video and image processing, a digital multimedia broadcasting (DMB) as well as a 3DG acceleration. Fig. 2(b) shows the system architecture which operates the 3DG application mode in the case of a high-end application. The 3DG application mode operates a 3DG hardware engine and a 3DG API in ARM9 at a minimum frequency to reduce the load of the application processor. This mode reduces 36% of power consumption compared to the normal mode. Alternatively, the 3DG acceleration mode operates only the 3DG hardware engine, moving the 3DG API into an external application processor and using clock-gating to reduce the power consumption on the ARM9 and the unused hardware blocks. This mode reduces 21% of power consumption compared to the 3DG application mode. In both applications, the 3DG SoC consumes less than 4.3 mW in a LCD bypass mode, when the entire block goes to the power-down mode.
Even though some mobile systems have a 3DG hardware engine, we need a 3DG API and a library to optimize the 3DG accelerator in order to use real 3DG acceleration. For that we support OpenGL ES 1.1+Ext, JSR 239, 3 D3M API, and a 3DG library. OpenGL ES is an API for using 2-D and 3-D graphics on a mobile application and an embedded system. The API provides a low-level interface between 1063-8210/$26.00 © 2010 IEEE software/hardware 3DG accelerator and software applications. These standard 3DG APIs allow mobile and embedded applications to support 3DG games and various 3-D graphics functions. Even though OpenGL ES is based on the 3-D graphics pipeline, each function can be used in the full 3DG hardware engine or the combination of hardware and software routines. This flexibility gives users the advantage of easily programming 3DG applications using the hardware engine functions of OpenGL ES. OpenGL ES includes an embedded graphics library (EGL), it allows users to program 3DG applications regardless of platform and OS. Users can also use an open 3DG library or a third-party library. Fig. 3 shows the task percentages of a CPU on a 3DG application with 3-D contents when the 3DG SoC is used as the application processor. OpenGL ES API occupies 17.3% of runtime in the normal mode for low-end applications. In this case, we can reduce power consumption by adjusting the clock frequency of CPMU on a normal operation. An example of another application is the following: once an application processor without a 3DG accelerator is used, we can easily add a 3DG function using the 3DG SoC which operates in the 3DG application mode with OpenGL ES API and 3DG accelerator for high-end application. OpenGL ES API can be moved to 3DG SoC, so we can reduce power consumption with the bypass mode of 3DG SoC once the application processor performs a normal operation-such as phone, camera, or some other normal operation.
III. PROPOSED 3-D GRAPHICS ACCELERATOR
The three functional blocks of the 3DG accelerator-the geometric engine (GE), the rendering engine (RE), and the display controller-are shown in Fig. 4 . The polygon-based operation block performs culling, clipping check and viewport transformation. The GE adopts a five- stage multi-cycled pipeline with an eight-cycle stage latency, which has model-view transform, lighting calculation, projection transform, clip test and perspective division transform stages. The finite-state machine (FSM) block controls the overall operation of the pipeline and an index cache for the GE.
The 3DG pipeline consists of a GE and an RE processing. The GE receives three vertices from a host processor, and then makes a triangle. After that, the GE continuously transforms another triangle, while considering the direction of a visual point, and performs the lighting process to make a color and the clipping process. Finally, the polygon-based process is performed. The maximum performance of geometry transformation is 37.5 Mvertices/s, if no other operations are performed. We achieved the best performance in a 3DG game using the full hardware clipping process without a decrease in speed. The most important process for improving performance in the 3DG accelerator is the clipping transaction. Even though the 3DG accelerator possesses a high 3-D transaction speed and pixel fill rate, it is unrealistic to expect high performance in real contents without accelerating the clipping process. For that, the clipping operation block checks the boundary line, and then performs the rendering process of only internal value even if the object exists on an external area in terms of viewing contents-such as a 3-D background passing by in a 3-D racing game. The GE engine achieves a high transaction speed using fully hardwired clipping check and operation block while enjoying a game. Even real contents require a high clipping transaction.
In the RE, the engine writes triangles to a frame buffer. The RE consists of the GE/RE interface, triangle setup, edge-walking process, texture map/stereoscopic 3-D process and per-fragment/ blending process blocks. The RE has triangle setup, texturing, fog, and per-fragment operation pipeline stages with the blocks working in coordination. Generally, it takes two clocks to write a triangle in the frame buffer because the 3DG calculation has to perform a depth comparison and read/write operation between GE and the frame buffer. It means that the pixel rate is 0.5 pixels/clock. The proposed 3DG accelerator achieves 1.5 pixels/clock using the pipeline structure. The GE/RE interface block controls the GE and RE pipeline stages and data exchange between GE and RE. The per-fragment block and the texture map block have a line buffer and texture buffer whose size is 131 Kbytes to improve a data bandwidth. The maximum LCD size is 1024 2 1024 pixels and a single texture is 256 2 128 pixels. We proposed an alternative left-right drawing method (ADM) for the stereoscopic 3-D processing. The conventional method has a performance inefficiency because it continuously fills every image in each left and right image line, whether or not this needs to be done. The proposed ADM improves 50% of the fill rate in comparison with the conventional one. The stereoscopic 3-D block supports red/blue, barrier, and glassed types. The 3DG SoC supports OpenGL ES 1.1, JSR-239 and D3DM APIs including seven primitives, eight light sources and a stereoscopic 3-D display. The 3DG performance is 8 Mpolygons/s at 100 MHz. The display controller supports the bypass mode which passes the video image though the 3DG SoC and directly to the LCD; it supports other modes which send the video image from 3DG accelerator to the LCD through a first-input-first-output (FIFO) and RGB/CPU/TV interfaces. Fig. 5 shows the block diagram of CPMU. The CPMU's role is to manage power consumption depending on the needs of the application. The 3DG SoC has three power domains-CPU, 3DG, and LCD, and six operation modes-the full, 3DG application, 3DG engine, SDRAM refresh, LCD, and power-down mode. The PLL covers frequencies ranging from 25 to 200 MHz with 25 MHz steps at 25 MHz of input clock frequency. The phase-locked loop (PLL) locking time is less than 600 us and it consumes 1.8 mW. The CPU and 3DG domain controller supports various clock combinations for each block through software according to its purpose and target performance. The normal mode in the 3DG SoC operates as a low-end application processor. The ARM9 operates with 3DG API at 25 MHz, and the 3DG accelerator operates at 100 MHz in the 3DG application mode. To reduce power consumption, the ARM9 can be turned off and 3DG turned on only for the hardware 3DG in the 3DG acceleration mode. The power consumption of the power-down mode is under 180 uW. Fig. 6 shows the experimental and evaluation board for 3DG SoC. The evaluation board consists of the 3DG SoC, DDR memory, flash memory, LCD display, etc. We used Futuremark's benchmarks to evaluate our 3DG SoC and to compare the previous works [1] - [6] . These contents are organized in the multi-texture which uses mipmaps (mip is a Latin abbreviation for much in a small space) and is responsible for using a 3-D graphics transaction. The first image at the bottom of Fig. 6 is the result of depth testing. The second image is the result of the lighting test. The third image shows the result of texture mapping. Table I is the comparison of technology, operation frequency, and performance with the previous works [1] - [6] . The JSSCC06 and JSSCC08 were designed for 0.13-m CMOS technology. Even though these works have an advantage of a more advanced technology, the proposed SoC achieved a better performance and a lower power than these works [3] , [5] .
IV. EXPERIMENTAL RESULTS
The power consumption of each operation mode is shown in Fig. 7(a) . We use Samurai 3-D movie to compare with the previous works [1]- [6] . We measure the power consumption and performance with the Samurai movie on full, 3DG application and 3DG acceleration mode as shown in Fig. 7(a) . In the normal mode, the 3DG accelerator occupied 47% of the total power consumption, ARM946 and peripherals are 41% and 8%, respectively. In the 3DG application mode, RM946 consumes 17% of the total power, so the power consumption is 36% less than the normal mode when the SoC operates as an application processor. In the 3DG acceleration mode, we also reduce to 21% less than the total power of the 3DG application mode. These results allow users to choose the best mode according to system architecture and its application. The 3DG application and acceleration mode of SoC can be specifically selected by its application to enable power reduction. Fig. 7(b) shows a comparison of the performance of the 3DG with the previous works [1] - [6] . The 3DG performance is measured, including transformation, clip check/processing, lighting, and texture blending for rendering. The performance index of the 3DG drawing speed/power consumption is used to compare the tradeoff between power consumption and performance. The 3DG accelerator improves the performance by 59% compared to the best work [5] . The work in [2] achieved 50 Mvertices/s, but it was only for the geometry performance. Full 3DG performance is 3.6 Mvertices/s and 155 mW [2] . We achieved 8 Mvertices/s and 108 mW in full 3DG performance. 
V. CONCLUSION
A 3DG SoC is designed for mobile applications. It integrates a fully hardwired 3DG accelerator, low power management block, multimedia processor and LCD controller. The SoC is fabricated in a 0.18 um 1P6M CMOS process. The chip size of the 3DG SoC is 49 mm 2 , which includes the 3DG accelerator, 131 kB texture buffer, and the ARM9 which has 16 K I/D-cache and 8 kB I/D TCM. The dual SDRAM controllers allow the ARM9 and 3DG to access each external SDRAM while avoiding data collisions. The SoC has reconfigurable operation modes to support various mobile applications with minimum power consumption. The SoC achieves 8 Mpolygons/s full 3DG, 24 Mvertices/s geometry performance and 1.5 pixels/clk. Table II summarizes the chip specification and features. Fig. 8 shows the chip micrograph of the 3DG SoC.
I. INTRODUCTION
The Gaussian, Rayleigh, and Ricean distributions have been applied to model and simulate a variety of different scientific and engineering systems. An especially important application of these distributions is to model wireless fading channels. The classical model of a communication channel is the additive white Gaussian noise (AWGN) channel, where the transmitted signal s(t) is corrupted by the addition of white Gaussian noise n(t) thereby producing a received signal y(t) = s(t)+ n(t) [1] . In a more accurate model of wireless channels, the received complex envelope is expressed as y(t) = g(t)s(t) + n(t), where the fading gain g(t) is a complex Gaussian random variable with independent quadrature components. If this fading process has a zero (nonzero) mean then the envelope jg(t)j of the gain has the Rayleigh (Ricean) distribution [2] .
The radio channel is usually the key factor that limits the performance of a wireless communication system. System performance is commonly characterized through the symbol error rate (SER) versus signal-to-noise ratio (SNR) relationship and this is typically measured experimentally using Monte Carlo (MC) simulations on workstations. Wireless communication systems are increasingly complex and the number of possible operating modes that must be verified has increased dramatically. As the number of possible operating modes increases (e.g., more than 300 modulation and coding schemes are present in the IEEE 802.11n standard), the bit-true fixed-point MC simulation times become a bottleneck to timely product design and verification.
Hardware-based simulation of digital communication systems offers significant speedups compared to software simulations, with no significant loss in accuracy [3] , [4] . Recently, several hardware implementations of Gaussian variate generators have been proposed (see [5] , [6] , and their references). However, hardware implementations of other important distributions, such as the Rayleigh and Ricean distributions, have received far less attention [7] , [8] . This brief extends our earlier work on designing Gaussian variate generators (GVGs) [6] . We now present high-throughput and compact Rayleigh and Ricean variate generators that are suitable for implementation on a field-programmable gate array (FPGA). We utilize the Box-Muller (BM) algorithm [9] to efficiently implement a Rayleigh variate generator. This generator is then enhanced to generate variates with the Ricean distribution. The Ricean variate generator can in turn be used to generate variates for two other important distributions: the Gamma distribution and the Chi-squared distribution with two degrees-of-freedom. The Gamma and Chi-squared distributions have been used to model interference in wireless communication systems [10] .
The sequel is organized as follows. Section II presents our FPGA implementation of the Rayleigh variate generator. Section III presents the new Ricean variate generator. The implementation costs and simulation results are presented in Section IV. Concluding remarks appear in Section V.
II. RAYLEIGH VARIATE GENERATOR
Let n i and n q be two independent normally-distributed variates with zero means and equal variance 2 . The variable defining the magnitude r = n 2 i + n 2 q has a Rayleigh distribution with mean =2 and variance (4 0 ) 2 =2 [11]. To implement a Rayleigh variate generator, instead of generating two independent Gaussian variables, ni and nq, and then computing the magnitude of the complex Gaussian-distributed variate n = n i + j n q , where j 2 = 01, we use the well-known BM algorithm. According to this algorithm, if u1 and u2 are two independent uniformly-distributed pseudorandom numbers (PNs) in the interval (0, 1) and f (u 1 ) = 02 ln(u 1 ), then n 1 = f (u 1 ) 2 sin(2u 2 ) and n 2 = f (u 1 ) 2 cos(2u 2 ) are two independent variates from a zero-mean, unit-variance Gaussian distribution N (0; 1). Therefore the variate r = n 2 1 + n 2 2 = f (u 1 ) follows the Rayleigh distribution.
