Abstract-In computer architecture research and design, simulator is a viable alternative for researchers to evaluate the functionality and performance of the future computer systems architecture. The cost and timeline to complete a project tremendously can be reduced with simulators, as it allows a quick quantitative evaluation of the wide range of architectures in computer systems, such as CPU, memory and I/O system. The M5 simulator, is one of the open source computer architecture available. This paper propose a graphic subsystem for the existing M5 simulator to allow research and development activities in the area especially for Graphic Processing Unit (GPU). The GPU model chosen for the M5 simulator is the PCI based Voodoo 3 GPU, manufactured by the 3Dfx Interactive Inc. The functionality of the built GPU model are evaluated by using several software engineering testing methodologies, and the programmability of the model is tested by using the Glide API, the 3D graphics API developed for the Voodoo Graphics 3D accelerator cards.
INTRODUCTION
M5 simulator, is one of the open source computer architecture simulator available and it is a modular platform for computer system architecture research, encompassing system level architecture as well as processor microarchitecture [1] .
Initially, the purpose of the development of the M5 simulator is to enable the research in the area of TCP/IP networking because of the lack of the simulation tools which limits architects' ability to explore the new designs for network I/O with the modeling of National Semiconductor DP83820 network interface controller [2] .
In this project, the purpose is to enable the research and development activities with the involvement of the GPU on the existing M5 simulator. GPU has become an integral part of today's mainstream computer system as well as embedded systems including gaming console and smart phone. With the evolvement of the GPU over the last decade, the GPU is not only a powerful graphics engine, but also a highly parallel programmable processor which enable the processing of non graphics application as well. In addition to that, there is a lack of computer architecture simulation tools that bundles together with the GPU.
The GPU model chosen to build purposely for the M5 simulator is the PCI based Voodoo 3 GPU manufactured by the 3Dfx Interactive Inc. The GPU has been selected due to widely available datasheet as well as simplicity of the architecture.
This project model a complete GPU 2D and 3D core function with its configuration and status register which can be assessed by the device driver through programmed I/O and with enough fidelity to support the unmodified Linux device driver. The VGA core, which is responsible for producing graphic on the monitor, is not implemented in this work. The functionality of the built GPU core for M5 simulator then undergo a several software testing such as unit testing and functional testing, and the programmability of the model is tested by using Glide, a 3D graphics API with Linux kernel version 2.6.27.6 on simulated M5 Alpha architecture. This paper is organized as follows: Section 2 presents the system overview of the GPU, the GPU architecture and the M5 computer architecture simulator core. In Section 3, we present the results and discussion of this project and lastly Section 4 presents the conclusion and recommendation.
II. SYSTEM OVERVIEW

A.
M5 Computer Architecture Simulator Core
, is a modular platform for computer system architecture research, developed at University of Michigan, encompassing system-level architecture as well as processor micro architecture. Some of key features of M5 include object orientation, multiple interchangeable CPU models, which include Alpha and SPARC. MIPS, ARM and x86 support are in progress. Event driven-memory system, multiple ISA support, full system capabilities are also the part of the key features of the M5 simulator.
Object Oriented (OO) techniques used in M5 provide a clear interface between simulation object and the rest of the system. The benefits are threefold. With M5 implemented in OO techniques, researchers can modify a component's behavior using only localized code changes with a higher likelihood of not breaking seemingly unrelated parts of the simulator. There is also an encouragement in realistic modeling, because violations of software modularity often indicate violations of hardware modularity. Also, with OO techniques, different models for a particular component, such as a CPU, can be substituted easily within a particular configuration.
M5 is implemented using two OO programming languages; high-level object configuration and simulation scripting are implemented using Python, and for low-level object implementation where performance is important, are implemented using C++. All simulation objects available in M5, (CPUs, busses and caches) are represented as objects in both Python and C++. With Python objects for configuration, this allows flexible script-based object composition to describe complex simulation targets and once the configuration is constructed in Python, M5 instantiates the corresponding C++ objects. M5 includes a variety of object models implemented on top of the core simulation engine. These models include CPUs, caches, busses, and I/O devices.
1) CPU Models
M5 contains two primary CPU models; SimpleCPU and O3CPU. Both models derive from a base CPU class and export the same interface, allowing them to be used interchangeably. M5 can also switch between CPU models during run-time, allowing the use of SimpleCPU for fastforwarding and warm-up phases and O3CPU for taking statistics.
SimpleCPU is an in-order, non-pipelined functional model that can be configured to execute one or more instructions per cycle, but can only have one outstanding memory operation where as for O3CPU, it is an out-of-order, superscalar, pipelined, simultaneous multithreading (SMT) model.
2) Memory System
The memory system within M5 is divided into two main types of objects; devices and interconnects. Components such as caches, memory and I/O devices are components, while interconnects are communication mechanisms such as busses and networks.
M5 supports configurable caches with parameters for size, latency, associativity, replacement policy and coherence protocol. Bus objects model a split-transaction bus which is configurable in both latency and bandwidth. A simple bus bridge object is available to connect busses of different speeds, e.g. the PCI bus and system bus.
3) I/O Devices
I/O devices are one of the class participating in the memory hierarchy responding to programmed I/O accesses at configurable address ranges and issuing DMA transactions. These I/O memory accesses are indistinguishable from CPU accesses in the memory hierarchy, and their timing is modeled with equal fidelity. In additional to other regular I/O devices, such as serial ports and disk controllers, M5 also models system chipsets (e.g. memory and interrupt controllers) with sufficient fidelity to boot unmodified OS kernels.
B. Overview of Graphical Processing Unit (GPU)
August 31, 1999 marks the introduction of the GPU for the PC industry. In general, GPU also called Video Processing Unit (VPU) is a dedicated graphic rendering device for personal computer, workstation, or gaming console [3] .
Technically, the definition of a GPU is a single chip processor with integrated transform, lighting and rendering engine that is capable of processing a minimum of 10 million polygons per second [4] .
The GPU can be in two forms, either dedicated or integrated. Dedicated graphic cards typically with the PC motherboard by mean of an expansion slot such as PCIe or AGP [5] . For dedicated type, it refers to that the card has a dedicated RAM for its own use while for integrated type, the card is integrated onto the motherboard of the system. Some of GPU manufacturers available in the industry nowadays include ATI Technologies, NVIDIA Corporation, 3Dlabs, Matrox, XGI Technology Inc., Intel and 3dfx [6] .
In general, the main graphics engine on the GPU is the graphics pipeline. The graphics pipeline is a conceptual model of stages that graphics data is sent through and it is a process of converting coordinates from what is easier for the application programmer into what is more convenient for the display hardware. Figure 1 shows the simplified 3-stage of graphics pipeline in a GPU. The first stage of the graphics pipeline is the application stage. In this stage, the application program is running on the CPU, generates the 3D triangle coordinates and feeds the command to the graphics subsystem.
The second stage of the graphics pipeline is the geometry processing stage. The geometry processing stage, takes the input which are in the form of models or shapes made up of triangles, 3D coordinates (x, y, z) and change it into a 2D coordinates triangle (x, y). This stage is also referred to as the "Transform and Lighting" stage. The transformation converts the 3D data from one frame of reference to a new frame of reference and the lighting is to bring the effects to enhance the realism of a scene and bring the rendered images one more step closer to our perception of the real world [6] .
The third stage of the graphics pipeline is the rendering stage, which fills the area of the pixels between the 2D coordinates with pixels to represent the surface of the object. In details, the rendering is a process of calculating the correct color for each pixel on the screen, given all the information delivered by the setup engine. This rendering engine must consider the color of the object, the color and direction of light hitting the object, whether the object is translucent, and what textures apply to the object [6] . This stage also can be referred to as the rasterization stage of the graphics pipeline [7] .
In early evolution of the GPU, the main purpose of the GPU is to handle the complex floating-point arithmetic operations to calculate the 3D geometry and vertices, and then applying to it to pixel lighting and color values. The early GPU contain a configurable 32-bit floating-point vertex transform and lighting processor, and a configurable integer pixel-fragment pipeline, which can be programmed with OpenGL and Microsoft DirectX APIs [8] .
GPU are becoming more flexible and programmable, not only built to handle the graphics computations but also with the programmable capability on it, this enable the exploitation of more general computations dubbed General Purpose computing on Graphics Processing Units (GPGPU). Researchers nowadays turn to GPU for some general purpose computations. Figure 2 : Floating point operations on CPU and GPU [9] Compared with the CPU, GPU has the highest computation power and from the figure 2 above, it can be concluded that the floating point operations per second of the GPU is higher than the CPU due to the difference of the internal architecture between them. Another reason is that the progress of the software platform. During the development of GPGPU, researchers had to write assembly instructions to conduct computation on GPU, but then after the introduction of CG, and Open GL Shading language, it becomes easier for the people to write the hardware code. Thus, this becomes more convenient for non-graphics people to take advantage of the power of GPU in their computation applications [9] .
C. GPU architecture
The GPU architecture taken into as reference in modeling a GPU plugin for M5 simulator is a Voodoo 3 GPU [10] , manufactured by 3dfx Interactive Inc. Figure 3 shows the architecture of the Voodoo 3 GPU. From the figure 3 above, there are two GPU cores in the Voodoo 3 GPU which are 2D and the 3D core; formed by the FBI and TMU block diagram.
In 2D block, there are 31 registers in total and almost all the registers are Read/Write type except for the status register and launchArea register, which are Read-Only. Reading a 2D register will always return the value that will be used if a new operation is begun without writing a new value to that register. This value will either be the last value written to the register, or, if an operation has been performed since the value was written, the value after all operations have completed.
In 3D block of Voodoo 3, there are 255 registers in total and almost all of the registers are fully writable except the status register, which is Read-Only register. To access the 3D register set, it uses 22 bit FBI memory mapped register address (4 Mbyte) and the FBI memory mapped register address is divided into the following fields. Byte  1  1  6  4  8  2 Both 2D and 3D registers are memory mapped. 2D registers have the offset 0x0100000 and 3D registers have the offset 0x0200000 at Memory Base0. Figure 4 and Figure 5 shows the block diagram of 2D and FBI/TMU in details respectively. 
AltMap Swizzle Wrap Chip Register
III. RESULTS AND DISCUSSIONS
A. Linux Kernel with GPU Driver
The purpose of having the GPU device driver is that to drive, manage, control, direct and monitor the hardware under its command. The linux kernel with the GPU driver, which Direct Rendering Manager (DRM) driver for Voodoo 3 graphic card used in this project is the linux kernel version 2.6.27.6. When the linux kernel is compiled, the file vmlinux, which is the static linked executable file that contains the linux kernel, is placed in the directory where the M5 simulator point to.
Figure 6 and figure 7 shows the screenshot of the M5 simulator booting up the compiled linux kernel. Figure 6 shows that the M5 simulator is booting up the linux kernel version 2.6.27.6 cross compiled with crosstool-NG and figure 4-2 shows that the linux kernel booted by the M5 simulator comes with unmodified linux driver for the Voodoo 3 graphic card. In figure 7 , also shows that the M5 simulator able to detect the plugged in GPU device, initialize and allocate the device as the PCI based device and assigned to Bus 0, Device 2 and Function 0 and the latency timer of the device is set to 64. 
B. Code Implementation
The GPU model is developed and implemented by using two OO languages; Python and C++. As the M5 simulator does not have the graphic subsystem, the new graphic subsystem, which is the graphic device module, is written using the Python language. After the new module is created, then only the GPU model is plugged-in into the system and the model is written using C++ language. Table 1 below presents the source code files created and modified in this project and inserted into the M5 simulator directory and all the files description is summarized in the Table 2 . Full system configuration file for M5 simulator Figure 8 depicts the object model in the M5 simulator for PCI device together with the GPU model developed in this project.
Since the graphic device is one of the PCI devices, the device is inherited from the PCIDev class of the M5 simulator. As the GPU model developed in this project is the graphic device, the new object, TDFXGPU, which describes the device parameter and PCI configuration of the device, is inherited from the GraphicDevice class. The arrows indicate that an object uses the attributes or services provided by another object. All these object model in the M5 simulator are implemented using Python programming language. 
C. Full System Testing
The full system testing of the built GPU model for M5 simulator has been implemented with four Glide API program as the test data to ensure that it is fully operational and reliable. The purpose of the full system testing is to test the programmability of the built GPU model and all the feature and function of the GPU model will be exercised on the four system tests. The procedure conducted and the outcomes of all system tests are discussed in the following section. Since the current GPU model does not have the VGA module, all the output from the four system test are displayed on the text file. The text file produced from each test case is used to evaluate whether the test is passed or failed. In summary, all four system test conducted in this project is a passed test.
The first test case in the full system testing is to test the triangle vertex setup performed by the built GPU model. Three triangle vertexes coordinate (x-axis and y-axis) and color value (RGB) of vertex A, vertex B and vertex C would be setup by the GPU by using the Glide API. The third test case in the full system testing is to test the point drawing performed by the built GPU model. Triangle vertex coordinate (x-axis and y-axis) would be setup by the GPU by using the Glide API.. 
Glide is shutdown TRUE
The fourth test case in the full system testing is to test the triangle vertex setup and arithmetic function performed by the built GPU model. Two triangle vertexes coordinate (x-axis and y-axis) of vertex A and vertex B would be setup by the GPU by using the Glide API. Glide is shutdown TRUE
IV. CONCLUSION
In this project, the GPU model is built and integrated together with the existing M5 computer architecture simulator. The built GPU model can be boot up with an unmodified Linux kernel version 2.6.27.6 with its driver on an Alpha architecture. The functionality of the built GPU model has been evaluated in this project by undergoing several software testing techniques and the programmability function of the model has been tested by using the Glide API.
