ABSTRACT: Soft-core processors are being increasingly used in various embedded applications due to their flexibility, cost effectiveness and platform independence. It enables designers to modify the core designs with ease to achieve specific application goals. In this paper, the design of an enhanced soft-core processor based on OpenCores that is suited for telecommunication, multimedia and a variety of embedded applications is presented. The OR1200 platform, which is a 32-bit DSP, with RISC Harvard micro architecture including a 5-stage integer execution pipeline, is used. We enhance the processor design to include a Global Memory Stall Controller which manages the Data Path Unit of the processor and distributes stall signals whenever the memory latency cannot be hidden. Also, we suggest improvements in the data path of the processor to enhance it for better multimedia applications. Finally, we propose to add a Hazard Controller to the execution pipeline to handle data and branch hazard.
INTRODUCTION
There has been an increasing demand for soft-core processors in the last few years as the complexity of embedded systems has increased, designing each and every hardware component of the system from scratch soon became far too impractical and expensive for most designers. Soft-core processors allow synthesis for any ASIC or FPGA technology and, therefore they provide designers with great flexibility. They have several advantages for a designer such as technology independence and a higher abstraction level of the architecture to make it easier to understand the design. Evolving multimedia and communication designs need adaptable hardware. As new requirements of the industry, design of processors become obsolete in a shorter span of time. As a challenge to hardware architects, they need to create processors with efficient speed and adaptable characteristics for applications that require high performance and efficiency.
Reconfigurable technology helps us implement flexibility and adaptability in soft core processors [1] . The improved Computer aided design and Election Design Automation (EDA) tools, helps to construct rapid prototypes of SOCs easily with the help of Hardware Description Languages [2] . Improvements in power consumption of soft-core processors combined with the recent advancement in its applications enable us to think of incorporating changes in the design of the OR1200 soft-core processor to enhance its efficiency [3] .
In this paper, we present an enhanced OR1200 core, which is would be well-suited for multimedia and telecom applications. The rest of the paper is organised as follows. Section 2 discusses our motivation and existing OR1200 architecture. Section 3 describes the three design modifications in detail. Figure 1 shows the complete architecture of the OR1200 soft-core processor. This processor can be synthesized and downloaded onto Altera and Xilinx FPGAs and supports embedded real time operating systems such as Linux, μLinux and OAR RTEMS real time operating systems. 
OR1200 IMPLEMENTATIONS
Various extensions to the original OR1200 have been suggested and implemented. They include, Plasma [5] , a synthesizable 32-bit RISC microprocessor having a 3-stage pipeline. Plasma runs a live web server with an interrupt controller, UART, SRAM, and the Ethernet Controller. Others include the aeMB [6], OpenFire [7] and the MB-lite [8] from the OpenCores community, based on the Xilinx's MicroBlaze Architecture.
The Leon3 [9] , a synthesizable model of a 32 bit processor compliant with the SPARC V8 architecture is one of the most advances open source processors for embedded systems. It consists of an advances 7 stage pipeline, multiple power modes for enhances power consumption and is extensively reconfigurable.
Second Student Research Symposium (SRS), International Conference on Advances in Computing, Communications and Informatics (ICACCI'13), 22 -25 August 2013, Mysore, India
PROPOSED MODIFICATIONS
In this section we discuss four possible modifications to the existing OR1200 architecture to improve its performance and throughput.
EMBEDDING HPRC IN EXECUTION UNIT
This phase proposes a new technique of embedding multigrain parallel processing HPRC using FPGA in the CPU/DSP unit of OR1200 a soft-core RISC processor. The core performance is increased by placing a multigrain parallel processing HPRC internally in the Integer Execution Pipeline unit of the CPU/DSP core. The performance of HRPC unit inside the OR1200 soft core is achieved by dynamic hardware multitasking [10] .Through ICAP (Internal Configuration Access port) the inter-module communication is ensured which is a reasonable medium for cross chip communication [11] .
The multigrain parallelism in HPRC is accomplished by two functions i)HPRC_Parallel_Start -to trigger the parallelism ii)HPRC_Parallel_End -to stop the parallelism.
An embedded parallel processing HPRC in OR1200 Integer Execution pipeline unit is shown in Figure 2 below. 
GLOBAL MEMORY STALL CONTROLLER
Memory plays a vital role in the performance of the processor. To improve the efficiency of the OR1200 memory interfacing, we introduce a global memory stall controller similar to the one used in the SecretBlaze Processor [12] . The function of the stall controller is to manage the memory subsystem and introduce stall signals when memory latency exists. The algorithm below, shows two functions of the stall controller. Function 1 halts the core if data or instruction cache is busy. Function 2 halts the data cache when the processor is in sleep mode or when the IO operation is incomplete or an instruction is not available. Hence the stall happens when a data is not available from the cache memory and the processors perform a memory read. We use busy signals to specify the type of memory operation that cannot be performed within a single clock cycle. 
DMA PATH RECONFIGURATION
The existing data path of the OR1200 is not suited for multimedia processing. To improve performance of multimedia applications on the OR1200, we make some design changes in the data path unit of the processor [13] . The memory, QMEM is divided into two segments of data memory and instruction memory respectively. The DMA channel is added to provide blocks of data transfer capability between external memory and the data memory. Thus the CPU can be free to do other tasks while the DMA channel can work concurrently with the data cache. This would increase the throughput to a good extent especially if there is a possibility of a high cache miss rate for some applications. The two core design is shown in the figure is an adaptation of the Hyperpipelined OR1200 implementation and suggested modifications in the draft. 
HAZARD CONTROLLER
To improve the 5-stage execution pipeline of the OR1200, we introduce a hazard controller to tackle data and branch hazards in the pipeline of OR1200. The Hazard Controller provides flush and stall signals during each stage of the pipeline. It is introduced when there is a data or a branch miss, to control the efficiency of the processor. This design has earlier been introduced in Secretblaze [12] , and our goal is to implement this modification in OR1200 and compare its performance to the existing implementation.
CONCLUSION AND FUTURE WORK
In this paper, we proposed modifications to the existing OR1200 architecture which would enhance its performance. It offers good flexibility and adaptability which balances computing performances and could be used as an embedded processor with multiple applications. Originally started off as a design exploration, currently we are in the stages of implementing the above modules in the OR1200 soft-core processor. Stage 1 of the above module which includes testing the reconfigured multigrain parallel HPRC was implemented successfully using matrix multiplication CoreMark benchmarks. Maheswari R. and Pattabiraman V. discuss the results of the reconfigured multigrain HPRC in their work [14] .
The next step involves testing and verification of all the modules of the proposed modifications suggested above. Once this is complete we intend to move forward to further improvements in the instruction set architecture and identifying area wise inefficient design and improving them.
