Abstract-This paper presents the design and development of a high performance and low power MIPS microprocessor and implementation on FPGA. In this method we for achieving high performance and low power in the operation of the proposed microprocessor use different methods including, unfolding transformation (parallel processing), C-slow retiming technique, and double edge registers are used to get even reduce power consumption. Also others blocks designed based on high speed digital circu its. Because of feedback loop in the proposed architecture C-slo w retiming can enhance designs that contain feedback loops. The C-slow retiming is well-known for optimization and high performance technique, it automatically rebalances the registers in the proposed design. The proposed high performance microprocessor is modeled and verified using FPGA and simulation results. The proposed methods in microprocessor have been successfully synthesized and imp lemented in Quartus II 9.1 and Strat ix II FPGA, to target device EP2S15F484C3, and power is analyzed with Xpower analyzer. Results demonstrate that the proposed method has high performance.
INTRODUCTION
Microcontrollers and microprocessors are used in everyday systems. Basically, any electronic systems that require co mputation or instruction execution require a microcontroller or microprocessor. Therefore, at the core of electronic systems with co mputational capability is a microprocessor. Microprocessors have grown fro m 8 bits to 16 bits, 32 bits, and currently to 64 bits. Microprocessor architecture has also grown fro m complex instruction set computing (CISC) based on reduced instruction set computing (RISC) based on a combination of RISC-CISC based and currently very long instruction word (VLIW) based [1] . So me microprocessors are optimized for high-performance servers, whereas others are optimized for long battery life in laptop computers. A computer architecture is defined by its instruction set and architectural state. The architectural state of the MIPS processor consists of the program counter and the 32 registers. Any MIPS micro architecture must contain all of this state. Based on the current architectural state, the processor executes a particular instruction with a particular set of data to produce a new architectural state. So me microarch itectures contain additional nonarchitectural state to either simp lify the logic o r improve performance [2] . The IBM System z10™ microprocessor is currently the fastest running 64-bit CISC (co mplex instruction set computer) microprocessor. This microprocessor operates at 4.4 GHz and provides up to two times performance improvement co mpared with its predecessor, the System z9® microprocessor. In addition to its ultrahighfrequency pipeline, the z10™ microprocessor offers such performance enhancements as a sophisticated branch-prediction structure, a large second-level private cache, a data-prefetch engine, and a hardwired decimal floating-point arithmet ic unit. The z10 microprocessor also imp lements new arch itectural features that allo w better software optimization across compiled applications. These features include new instructions that help shorten the code path lengths and new facilit ies for software-directed cache management and the use of 1-M B v irtual pages [3] .
In recent years, a nu mber of researches have been proposed for Implementation of microprocessor by using the FPGA by some co mpanies [4] [5] [6] [7] [8] [9] [10] [11] . In continuing we present some researches, The Aviation Microelectronic Center of NPU (Northwestern Polytechnical University) has recently comp leted the development of a 32-bit super-scalar RISC microprocessor, called " Longtium" R2. In [4] , the arch itecture of " Longtium" R2 is presented. As exp lained in [5] , the PowerPC 603e microprocessor is a high performance, low cost, low power microprocessor designed for use in portable computers. The 603e is an enhanced version of the PowerPC 603 microprocessor and extends the performance range of the PowerPC microprocessor family of portable products. The enhancements include increasing the frequency to 100 M HZ doubling the onchip instruction and data caches to 16K bytes each, increasing the cache associativity to 4-way setassociative, adding an extra integer unit, and increasing the throughput of stores and misaligned accesses. Three new bus modes are added to allo w for mo re flexib ility in system design. The estimated performance of the 603e at 100 M Hz is 120 SPECint92 and 105 SPECfp92. In [6] , a new microprocessor design framework, called DOTTA (dynamic operation transport triggered array) is introduced. An FPGA (field programmab le gate array) implementation of DOTTA is presented. The aim o f this new processor framewo rk is to eliminate bottlenecks introduced by traditional microprocessor architectures. The framework defines a task-specific microprocessor that is application customisable on the target system it is operating on; this has been achieved by using Xilin xtrade FPGA fabric.
In [7] , a 8-b it CISC microprocessor core development which is intended as an open core for teaching applications in the digital systems laboratory. The core is fully open and therefore, the user can have access to all internal signals as well as the opportunity to make changes to the structure itself which is very useful when lecturing microprocessor design. The main advantages of the present core, compared with commercially availab le equivalent cores, are not being vendor sensitive allo wing its implementation in almost any FPGA family and being an open core, that can be fully mon itored and modified to fit specific design constraints. Several tests were performed to the microprocessor core, including an embedded microcontroller with RAM, ROM and I/O capabilit ies. The present development includes a metaassembler and lin ker to embed user programs in a ROM , which is automatically generated as a VHDL description. In [8] , a new methodology based on practical sessions with real devices and chips is proposed. Simple designs of microprocessors are exposed to the students at the beginning, raising the complexity gradually toward a final design with a mu lt iprocessor integrated in a single FPGA chip. In [9] , the imp lementation and delivery of a microprocessor based on laboratory design is presented design laboratory, in an attempt to achieve tighter integration with theory and improve student's performance. The design process follows a hierarchical structure, requiring students to first build basic devices such as logic gates, multip lexers, one-bit memo ry cells etc. These basic devices are then used to build an ALU, registers, (reg isters are used to build larger memories), a data path and a control unit. Designs are completed without any high level programming ensuring that students cannot rely on the compiler to transform specifications into imp lementations. In [10] , several case studies that examine the effects of various embedded processor memory strategies and peripheral sets. Co mparing the benchmark system to a real-world system, the study examines techniques for optimizing the performance and cost of an FPGA embedded processor system. The development of a microprocessor based on automatic gate is presented by [11] . The inconvenience encountered in gate operations has called for an immense search for solutions. The microprocessor based on automatic gate offers everything necessary to put an end to these inconveniences as it incorporates an intelligent device (microprocessor). The automatic gate developed their project is unique in that it is controlled by software, which can be mod ified any time the system demands a change. The main goal of this paper is to design a high performance microprocessor with a novel way on FPGA. We applied several methods to microarch itecture the performances and the effectiveness of the proposed methods of the throughput rate and hardware cost for the proposed structure are given. This paper is organized as follows. An overview of the MIPS processor will be given in section II. In section III description of the proposed methods is presented. A comparison of perfo rmance, power consumption, chip utilizat ion that verifies the performance of the proposed work is focus of section IV. In section V conclusion of the paper is presented.
II. MIPS PROCESSOR
MIPS is a 32-bit architecture, so its a 32-b it data path. The control unit receives the current instruction fro m the data path and tells the datapath how to execute that instruction. Specifically, the control unit produces mu ltip lexer select, reg ister enable, and memo ry writes signals to control the operation of the data path. The program counter is an ordinary 32-b it reg ister. Its output, PC, points to the current instruction. Its input, PC ' , indicates the address of the next instruction. The instruction memory has a single read port. It takes a 32-bit instruction address input, A, and reads the 32-bit data (i.e., instruction) fro m that address onto the read data output, RD. The 32-element × 32-b it register file has two read ports and one write port. The read ports take 5-bit address inputs, A1 and A2, each specifying one of 2 5 =32 reg isters as source operands. They read the 32-bit register values onto read data outputs RD1 and RD2, respectively. The write port takes a 5-b it address input, A3; a 32-bit write data input, WD; a write enable input, WE3; and a clock. If the write enable is 1, the register file writes the data into the specified register on the rising edge of the clock. The data memory has a single read/write port. If the write enable, W E, is 1, it writes data WD into address A on the rising edge of the clock. If the write enable is 0, it reads address A onto RD. The instruction memory, register file, and data memo ry are all read comb inationally. In other words, if the address changes, the new data appears at RD after some propagation delay; no clock is involved. They are written only on the rising edge of the clock. In this fashion, the state of the system is changed only at the clock edge. The address, data, and write enable must setup sometime before the clock edge and must remain stable until a hold time after the clock edge. Because the state elements change their state only on the rising edge of the clock, they are synchronous sequential circuits. The microprocessor is built on clocks state elements and combinational logic, so it too is a synchronous sequential circu it. Indeed, the processor can be viewed as a giant fin ite state machine, or as a collection of simp ler interacting state machines [2] . To keep the microarch itectures easy to understand, we consider only a subset of the MIPS instruction set: 
A. MIPS Microarchitectures
In this part, based on [2] , a MIPS microarchitecture executes instructions in a single cycle. The first step is to read this instruction fro m instruction memory. That the PC is simp ly connected to the address input of the instruction memory. The instruction memory reads out, or fetches, the 32-bit instruction, labeled Instr. The processor's actions depend on the specific instruction that was fetched. For a lw instruction, the next step is to read the source register containing the base address. This register is specified in the rs field of the instruction, Instr 25:21 . These bits of the instruction are connected to the address input of one of the register file read ports, A1. The reg ister file reads the register value onto RD1.
The lw instruction also requires an offset. The offset is stored in the immediate field of the instruction, Instr 15:0 . Because the 16-b it immed iate might be either positive or negative, it must be sign-extended to 32 bits. The 32-bit sign-extended value is called SignImm. Sign extension simp ly copies the sign bit of a short input into all of the upper bits of the longer output. Specifically, SignImm 15:0 = Instr 15:0 and SignImm 31:16 = Instr 15 . The processor must add the base address to the offset to find the address to read from memo ry. Introduces an ALU to perform this addition. The A LU receives two operands, SrcA and SrcB. SrcA co mes from the register file, and SrcB co mes fro m the sign-extended immediate. The ALU can perform many operations. The 3-bit ALUControl signal specifies the operation. The A LU generates a 32-b it A LUResult and a Zero flag, that indicates whether A LUResult == 0. For a lw instruction, the ALUControl signal should be set to 010 to add the base address and offset. ALUResult is sent to the data memo ry as the address for the load instruction, as shown in Fig.1 . The data is read fro m the data memo ry onto the ReadData bus, then written back to the destination register in the register file at the end of the cycle, Port 3 of the register file is the write port. The destination register for the lw instruction is specified in the rt field, Instr 20:16 , which is connected to the port 3 address input, A3, of the reg ister file. The ReadData bus is connected to the port 3 write data input, WD3, of the register file. A control signal called RegWrite is connected to the port 3 write enable input, WE3, and is asserted during a lw instruction so that the data value is written into the register file. While the instruction is being executed, the processor must compute the address of the next instruction, PC ' . Because instructions are 32 b its = 4 bytes, the next instruction is at PC + 4. Uses another adder to increment the PC by 4. The new address is written into the program counter on the next rising edge of the clock. This co mpletes the data path for the lw instruction. Like the lw instruction, the sw instruction reads a base address from port 1 of the register and signextends an immed iate. The ALU adds the base address for the immed iate to find the memory address. The register is specified in the rt field, Instr 20:16 . These bits of the instruction are connected to the second register file read port, A2. The register value is read onto the RD2 port. It is connected to the write data port of the data memo ry. The write enable port of the data memory, WE, is controlled by MemWrite. For a sw instruction, MemWrite= 1, to write the data to memory; ALUControl = 010, to add the base address and offset; and RegWrite = 0, because nothing should be written to the register file. The R-type instructions read two registers from the register file, and write the result back to a third register file. They differ only in the specific ALU operation, using different ALUControl signals.
The register file reads two registers. The A LU performs an operation on these two registers. In Fig.1 , the ALU always receives its SrcB operand from the sign-extended immed iate (SignImm). A mu ltip lexer choose SrcB fro m either the register file RD2 port or SignImm. The mu ltip lexer is controlled by a new signal, ALUSrc. A LUSrc is 0 for R-type instructions to choose SrcB fro m the register file; it is 1 for lw and sw to choose SignImm. The reg ister file always gets its write data fro m the data memo ry. However, R-type instructions write the ALUResult to the register file. Therefore, another mult iplexer is needed to choose between ReadData and ALUResult. That calls its output Result. This mult iplexer is controlled by another new signal, Memto Reg. MemtoReg is 0 for R-type instructions to choose a result from the ALUResult; it is 1 for lw to choose ReadData. The register to write was specified by the rt field of the instruction, Instr 20:16 . However, for R-type instructions, the register is specified by the rd field, Instr 15:11 . Thus, add a third mu ltip lexer to choose WriteReg fro m the appropriate field of the instruction. The multip lexer is controlled by RegDst. RegDst is 1 fo r R-type instructions to choose WriteReg fro m the rd field, Instr 15:11 ; it is 0 for lw to choose the rt field, Instr 20:16 . Finally, beq co mpares two registers. If they are equal, it takes the branch by adding the branch offset to the program counter. That the offset is a positive or negative number, stored in the imm field of the instruction, Instr 31:26 . The offset indicates the number of instructions to branch past. Hence, the immed iate must be sign-extended and mu ltip lied by 4 to get the new program counter value: PC ' =PC+4+
SignImm × 4. The next PC value for a taken branch, PCBranch, is co mputed by shifting SignImm left by 2 bits, then adding it to PCPlus4. The left shift by 2 is an easy way to mult iply by 4, because a shift by a constant amount involves just wires. The t wo reg isters are compared by co mputing SrcA-SrcB using the ALU. The pipelined datapath is formed by chopping the single cycle datapath into five stages separated by pipeline reg isters. Fig.2 We simply analy ze using Data Flow Graph (DFG), thus we should convert microarchitecture (MIPS processor) to DFG in this proposed DFG we consider only main blocks of microarchitecture including: Instruction Memory (IM), Register File (RF), A LU (A), Data Memory (DM). The proposed DFG o f MIPS processor is shown in Fig.3 this DFG is included feedback loops. 
III. Proposed methods
This part of paper is considered in exp laining proposed methods, now in continuing our description any method in the following sections. The goal of this work, is the design and development of one high performance microprocessor based on MIPS processor thus using several techniques. The design composed these different techniques, all parts are interconnected with each other to perform the designed method task to achieve high performance and high speed.
A. Unfolding Transformation
Unfolding is a transformation technique to change the program into another program such that one iteration of the new program describes more than one iteration in the original program. Also used to design bit parallel and word parallel architectures fro m bit serial and word serial arch itecture. Base on unfolding is same parallel processing. This technique can be used in Samp le period reduction and parallel processing.
a. Algorithm for Unfolding
Word-level parallel processing, Bit level parallel processing is included: Bit-serial processing, Bit-parallel processing, and Dig it-serial processing. In the proposed method we use unfolding technique and design proposed architecture base on 4 level parallel processing. Until increase speed of data process. Fig.4 shows the interconnection between different system blocks. As seen proposed design based on 4 level parallel processing (unfolded), DM0 and DM1 are Dual Port RAM (DPRAM). Fig. 5(b) . Note that unfolding of an edge with w delays in the original DFG p roduces j-w edges with no delays and w edges with 1 delay in j-unfolded DFG when unfolding of an edge with w delays in the orig inal DFG when w < j. Unfolding of an edge with w delays in the original DFG produces J-w edges with no delays and w edges with 1delay in J unfolded DFG for w < J. Un folding preserves precedence constraints of a DSP program. Applications of Unfo lding include: Sample Period Reduction, Parallel Processing.
c. Parallel processing
Iteration bound is not an integer. The orig inal DFG cannot have sample period equal to the iteration bound because the iteration bound is not an integer. If a critical loop bound is of the form t l /w l where t l and w l are mutually co-prime, then w l -unfolding should be used. In the examp le t l =60 and w l =45, then t l /w l should be written as 4/3 and 3-unfolding should be used. A simp le example of this is shown Fig 6(a) , where the DFG has iteration bound T=4/3; however, even retiming cannot be used to achieve a critical path of less than 2. This DFG can be unfolded DFG has iteration bound T=4, and its crit ical path is 4. The unfolded DFG performans 3 iterations of the original problem in 4, so the sample period of the unfolded DFG is 4/3, wh ich is the same as the iteration bound of the original DFG. To summarize, the original DFG in Fig 5(a) cannot achieve a sample period equal to the iteration bound because the itration bound is not an integer, but the unfolded DFG in Fig 5(b) can have a sample period equal to the iteration bound of the original DFG. In general, if a critical loop bound is of the form t1/w1 where t1 and wl are mutually coprime, then wl-unfold ing should be used. 
B. C-Slow Retiming
Although pipelin ing is a huge bene fit FPGA design , and may be required on some FPGA fabrics it is often difficult fo r a designer to manage and balance pipeline stages and to insert the necessary delays to meet design requirements. Leiserson et al. Were the first t o propose returning, an automat ic process to relocate pipeline stages to balance a design. Their algorithm, in O (n2lg (n)) time, can rebalance a design so that the critical path is optimally pipelined. In addit ion, two modi fications, repipelining and C-slo w retiming, can add additional pipeline stages to a design to further imp rove the critical path. The key idea is simple: If the number of registers around every cycle in the design does not change, the end-to-end semantics do not change. Thus, retiming attempts to solve two primary constraints: All paths longer than the desired critical path are registered, and the number of registers around every cycle is unchanged.
This optimizat ion is useful fo r conventional FPGAs but absolutely essential for fixed -frequency FPGA architectures, which are devices that contain large numbers of registers and are designed to operate at a fixed, but very high, frequency, often by pipelining the interconnect as well as the computation. To meet the array's fixed frequency, a desig n must ensure that every path is properly registered. Repipelin ing and C-slo w retiming enables a design to be transformed to meet this constraint. Without automated repipelin ing and C-slo w retiming, the designer must manually ensure that all pipeline constraints are met by the design. The goal of retiming is to move the pipeline reg isters in a design into the optimal position. Fig.7 shows a trivial examp le. In this design, the nodes represent logic delays (a), with the inputs and outputs passing through mandatory, fixed registers. The critical path is 5, and the input and output registers cannot be moved. Fig.7(b) shows the same graph after retiming. The critical path is reduced from 5 to 4, but the I/O semantics have not changed, as three cycles are still required for a datum to proceed fro m input to output. As can be seen, the init ial design has a critical path of 5 between the internal reg ister and the output. If the internal reg ister could be moved forward, the critical path would be shortened to 4. However, the feedback loop would then be incorrect. Thus, in addition to moving the reg ister forward, another reg ister would be needed to be added to the feedback loop, resulting in the final design. Additionally, even if the last node is removed, it could never have a critical path lower than 4 because of the feedback loop. There is no mechanis m that can reduce the critical path of a single-cycle feedback loop by mov ing registers: Only addit ional registers can speed such a design. Retiming's objective is to automate this process: For a graph representing a circuit, with combinational delays as nodes and integer weights on the edges, find a new assignment of edge weights that meets a targeted critical path or fail if the critical path cannot be met. Leiserson's retiming algorith m is guaranteed to find such an assignment, if it exists, that both minimizes the critical path and ensures that around every loop in the design the number of registers always remains the same. It is the second constraint, ensuring that all feedback loops are unchanged, which ensures that retiming doesn't change the semantics of the circuit. In equation below r (u) is the lag co mputed for each node (which is used to determine the final number of registers on each edge), w (e) is the in itial nu mber of registers on an edge, W (u,v) is the min imu m nu mber of registers between u and v, and D (u,v) is the crit ical path between u and v. Leiserson's algorith m takes the graph as input and then adds an additional node representing the external world, with appropriate edges added to account for all I/Os. This addit ional node is necessary to ensure that the circuit's global I/O semantics are unchanged by retiming. Two matrices are then calculated, W and D, that represents the number of registers and critical path between every pair of nodes in the graph. These matrices are necessary because retiming operates by ensuring that at least one register exists on every path that is longer than the crit ical path in the design. Each node also has a lag value r that is calculated by the algorith m and used to change the number of registers that will be placed on any g iven edge. Conventional retiming does not change the design semantics: All input and output timings remain unchanged while minor design constraints are imposed on the use of FPGA features. The biggest limitation of retiming is that it simp ly cannot improve a design beyond the designdependent limit produced by an optimal placement of registers along the critical path. Repipelin ing and C-slo w retiming are transformations designed to add registers in a predictable matter that a designer can account for, which retiming can then move to optimize the design. Repipelin ing adds registers to the beginning or end of the design, changing the pipeline latency but no other semantics. C-slow retiming creates an interleaved design by replacing every register with a sequence of C registers [12] .
a. Proposed Method based on C-slow Retiming
As explained in [12] , unlike rep ipelining, C-slow retiming can enhance designs that contain feedback loops. C-slowing enhances retiming simply by replacing every register with a sequence of C separate registers before retiming occurs; the resulting design operates on C distinct execution tasks. Because all registers are duplicated, the computation proceeds in a round-robin fashion, as illustrated in Fig.6 . In th is examp le, this is 2-slow, the design interleaves between two computations. On the first clock cycle, it accepts the first input for the first stream of execution. On the second clock cycle, it accepts the first input for the second stream, and on the third it accepts the second input for the first stream. Because of the interleaved nature of the design, the two streams of execution will never interfere. On odd clock cycles, the first stream of execution accepts input; on even clock cycles, the second stream accepts input. proposed architecture base on 2-slow retiming and 3-slow retiming. C-slow ret iming can be enhanced by designing with feedback loops. Thus we with more registers retiming can break the design into finer p ieces until improves throughput total of design. It also enables more throughputs when explo iting task level parallelism. Using C-slow ret iming technique in proposed method automatically rebalances the registers in the proposed design, in order to min imize the worst case register to register path. Fig.9 shows the proposed architecture for 2-slow retiming. 
C. Double Edge Triggered Registers
One of the reasons using double edge register is reducing power consumption. Thus we in proposed datapath transition of data that are in two edge clock signal i.e. we reduce the operation frequency to half but without reducing efficiency. As exp lained in [13] , a digital system, synchronization/clocking has its special role. By its action as timing signal the system clock controls the working rhyth m of the ch ip. the clock signal controls all flip-flops to samp le and store their input data synchronously. In addition, to distribute the clock and control the clock skew, one needs to construct a clock network with clock buffers. Recent studies indicate that the clock signals in digital computers consume a large (15% -45%) percentage of the system power [14] . In the stored state, the clock level switches off the input path, and, the input data are thus rejected, wh ile the input state, the clock level allows the input signal to travel to the output terminal o f the latch. However, if the input date can be received and sampled at both levels of the clock, the flip-flop will receive and process two data values in one clock period. In other words, the clock frequency could be reduced by half while keeping the data rate the same. This means that under the requirement of preserving the original circuit function and data rate, the dynamic power dissipation due to clock transitions can be reduced by half. It is expected that the half-frequency reduced clock system is useful in lo w power systems. The latch is the basic unit for co mposing a flip-flop. The levels of a clock, CLK, are used to drive the latch to either the storage state or the input state. If we use D, Q and Q' to express the input signal, present state and next state of a latch, the state equations for positive and negative level-sensitive latch can be expressed as:
2) describes a latch wh ich passes the input data when CLK = 1 and stores it when CLK = 0. Inversely, equation (3) describes a co mplementary latch, which receives input data at CLK = 0 and stores it at CLK = 1. The corresponding logic structures can be realized with a M UX, as shown in Fig.9(a) . If two complementary latches are connected in series, one will be in the storage state while another is in the input state and a "non-transparent" edge triggered flip-flop is formed. Taking the latches in Fig.11 (a) , they can compose a well known "master-slave flip-flop" as shown in Fig.11 (b) , when CLK = 1, its master latch passes input data and its slave latch is in the storage state; when CLK = 0, its master latch will be in the storage state and its slave latch will pass and output the signal stored by the master latch. Therefore, this flip-flop changes its state at the clock's falling edge and keeps its state unchanged on the clock's rising edge. The master latch does not receive the input data when CLK = 0. Obviously, if the input data has to be received at both clock levels, these two comp lementary latches should be connected in parallel rather than in series. Then obtain a "side-by-side flip-flop" as shown in Fig.11 (c) . Since the flip-flop is required to be non transparent fro m input to output, the output terminal should always be connected to the latch which is in storage state. Because the flipflop's state can change at both falling and rising edges of the clock, it is named "Double-Edge-Triggered FlipFlop" and is denoted by the legend shown in Fig.11 (c) . The proposed double edge register is shown in Fig.12 . We increase system throughput by applying parallel streams of execution to operate simultaneously. Additional to data processing with parallel execution path, we use several techniques. The ideal goal of all these methods is to increase architecture throughput by operating on multipath streams of execution in highly complex designs, including microprocessors. The proposed microarchitecture is shown in Fig.13 . Total of methods are applied to the MIPS processor. As seen in We use a demaltip lexer 1 to 4, That is after instruction memo ry and based on bits 33:32 of the instructions. Proposed instruction is shown in below:
Bits 33:31 (I 33:32 ) are for selecting one of level parallelism. As seen in Table II Structure of instruction is shown in Fig 14(a) , for example, an addition instruction (a = b + c) that is in the level 0 format of this instruction is in Fig 14 (b The proposed microarchitecture is a 34-b it architecture, so its a 34-bit data path. The control unit receives the current instruction from the data path and tells the datapath how to execute that instruction. Specifically, the control unit produces mu ltip lexer select, and memo ry writes signals to control the operation of the data path. The program counter register contains the address of the instruction to execute in one of the levels. The PC is an ordinary 32-b it register. Its output, PC, points to the current instruction. We begin constructing the datapath by connecting the state elements with combinational logic that can execute the various instructions in ALUs we also use high speed digital circuits for adders and multip lier. Control signals determine wh ich specific instruction is carried out by the data path at any given time. The controller contains combinational logic that generates the appropriate control signals based on the current instruction proposed control unit that is based on logic gates. The proposed datapath is formed by: 4 parallelism level, C-slo w retiming with applied double edge registers (dash line demonstrated in microarchitecture), demalt iplexer 1 to 4 for selecting any one of level parallelism, this princip le of enhancing the datapath's capabilities is extremely useful, one multip lexer 4 to 1 to choose inputs fro m several possibilities wh ich is for selecting next instruction or branch instruction, and in the proposed microarch itecture (we) instead of one port RAM for Data Memory we use one Dual Port RAM(DPRAM) for any two level parallelis m thus two DPRAM are use in total of the proposed architecture.
IV. Co mparison
This paper presents a novel high performance and low power microprocessor based on MIPS processor on FPGA for high speed applications. In this paper for verification MIPS processor is chosen as a benchmark for test applied methods. The proposed method has better performance than a traditional MIPS processor. The proposed method has been written with VHDL hardware description language. In order to get actual numbers for the hardware usage and maximu m operation frequency this work was synthesized and implemented using Quartus II 9.1V software. shows logic utilization (nu mber of registers, combinational A LUTs, total block Memory bits) and maximu m frequency in trad itional M IPS (M IPS_only), pipelined tradit ional MIPS with one edge register (MIPS_pipe), the proposed architecture with one edge register(method_1), and proposed architecture with a double edge register(method_2). Table III shows the power consumption in methd_1 and method_2. Po wer is analyzed using the Xilin x XPower analy zer. XPower is the power analysis software availab le for programmab le logic design. It enables to interactively and automatically analy ze power consumption for Xilin x FPGA and CPLD devices. XPower includes both interactive and batch applications. Earlier in the design flow than ever, the total device power, power per-net, routed, partially routed or unrouted designs can be analyzed. The achieved results of applied the proposed method to MIPS processor show that a low power and fast architecture have been achieved successfully. As seen in the two above Tables maximu m frequency is increased in method_1 (the proposed architecture with one edge register) and power consumption of the proposed method with a double edge reg ister (method_2) is decreased.
V. CONCLUSION
The aim of this paper is to develop imp lementation an FPGA based novel modified high performance and lo w power microprocessor. Proposed methods applied to increase performance and reduce power consumption of the designed architecture included, unfolding transformation (parallel processing), C-slo w retiming, and also with using a double edge register. The main ideal goal of all these methods is to increase architecture throughput by reduce execution t ime in h ighly co mplex designs including microprocessor. The achieved results ensure verification of the proposed microarchitecture on high-speed FPGA.
