The paper presents three embedded soft-core processor architectures, developed by the authors to be easily implementable while yielding low digital resource usage. These architectures will be compared and contrasted between each-other by introducing a special testing method, based on control algorithm implementations. For reference, the same testing and comparison has been implemented on a well established architecture, too, on the Xilinx PicoBlaze processor. Measurement results and application suggestion are given in the concluding section.
Introduction
In this paper we present three custom-built microprocessor architectures that are optimized to have low reconfigurable resource usage in order to be easily implementable on low-cost FPGAs. These are minimalistic 8 and 16-bit processors intended for utility functions and also for educational use in FPGA based, System-on-Chip (SoC) designs. The design goals were aimed at a good balance between the number of used logic cells and on-chip memories, reasonable performance, and a maximum clock frequency. This is important when used as utility processor in a SoC, in order to avoid the frequency bottleneck of the design. Among the presented architectures, there are designs that are built on a pipelined accumulator architecture with directly addressable registers stored in on-chip memory.
A big majority of the System-on-Chips (SoC) today [1] , [2] , [3] are built on a synchronous architecture and operate at a single clock frequency. In these cases, more than 40% of the power is consumed by the clock network. Clock skew issues are increasing the complexity of clock routing to various parts of the processor. Therefore, one of the most important challenges in such designs are the clock distribution, timing closure, clock frequency up-scaling, power consumption optimization and reusability [4] . As the quest for smaller digital footprint and higher speed continues [5] , clock management and power dissipation complexity increases, affecting design performance. It is certainly possible to create a soft core 8-bit or 16-bit microprocessor in an FPGA, but the sort of performance you might achieve depends on a range of factors like the instruction set and addressing modes and the type of architecture selected (pipeline stages, dependency analyses, etc.) [6] . We are presenting here three possible avenues, each with assets and drawbacks, relative to the state-of-the-art, represented in this case by the Xilinx PicoBlaze architecture.
The Sapientia Lab Processor (SLP)
This processor core has been developed primarily as an educational tool for engineering students. Nonetheless, it can be useful in many applications due to its low resource usage, ease of programmability and the set of peripheral modules attachable. The aim was to design a 16-bit RISC microcontroller in VHDL and implement it on an FPGA, taking as a reference the Xilinx PicoBlaze processor, presented briefly in a following chapter. This processor core has been designed using modular blocks that follow the von Neumann architecture. It is built on a RISC architecture with a two field, 16-bit instruction format, consisting of a 5-bit instruction code and an 11-bit operand address. It has a 256 level deep, memory stored stack and an Arithmetic Logic Unit (ALU) for operations with 16-bit data and logic bit-level instructions. The memory module uses one BlockRAM of a Spartan 3E Xilinx FPGA that is divided into three sections. The program memory can fit 1Kx16-bit instructions while the data memory can store 768x16-bit variables. The block diagram of the processor architecture is presented in Fig. 1 . The main modules of the SLP -visible in Fig. 1 -are the register block, the ALU and the control unit (CU). The register block engulfs the following registers: general purpose (GP), Address, Data, Instruction, Program Counter (PC), Stack Pointer (SP) and the Flag storage (for the Zero, Carry, Borrow and Interrupt Enable flag bits). The ALU can process basic operations with 16-bit operands, with the exception of the multiplication, for which only the lower bytes of the variables are used. The CU is a hybrid design with hardwired logic for the instruction fetch and decoding phase and microprogrammed for the execution phase. This yields 5 to 11 clock cycle instruction execution times at a maximum frequency of 35MHz, but is easily usable as an educational tool, too. Branching is implementable via four types of conditional jumps and function call mechanism. Optionally, an interrupt system can be enabled for the control of the different input/output modules that can serve a wide range of applications.
Embedded FPGA based multicore microcontroller designed for control algorithms (EMMC)
The second microcontroller presented in this paper is a modified Harvard architecture RISC machine with two, four stage pipelines for instruction execution. Instruction dependencies are checked at hardware level. Ideally, after filling the pipelines, the processor can run eight instructions in parallel. To provide high accuracy, the arithmetical unit of the processor uses floating point 32 bit values. All instructions are executed in 6 clock cycles, the maximum clock frequency is 25 MHz. The arithmetical unit has four modules for each arithmetical operation (ADD, SUB, MUL, DIV). The processor architecture is presented in Fig. 2 . Since it was built for system control applications, the microcontroller contains A/D and D/A peripheral modules to communicate with the outside world. The instructions are downloaded to the program memory through the UART module. An additional encoder module is built in, to read incremental encoders. During instruction execution the control unit reads two consecutive instructions from the program memory and checks for data dependencies. When no dependencies are found, the instructions are sent to the pipeline for execution. If data dependencies are found the instruction is saved in a FIFO and will be executed later. In case of data dependencies not all pipeline stages are used. The microcontroller has a simple instruction set with only ten instructions: four arithmetical instructions (ADD, SUB, MUL, DIV), four instructions for data movement (READ, OUTPUT, READEN, MOV) one JUMP instruction and a variable instruction to move variables to the data memory.
The Single Cycle Computer
The Single Cycle Computer (SCC) 16-bit processor is based on architectural elements presented in [7] . The architecture has been adapted and extended to include the elements illustrated in Fig. 3 . The ALU of the SCC can perform typical operations with one or two operands (ADD, SUB, ROTATE, SHIFT), computing, also the flag bits usable in branching statements. A separate module of the SCC is responsible for the data and stack memory management. The writeback components saves the results of the ALU operations into one of the eight locations of the general purpose register bank and updates the special function registers, too. The processor also features a timer/counter module and has a simple interrupt systems. The SCC has been prototyped on remote FPGA hardware using viciLogic [8] . The SCC IDE enables assembly program capture, compilation and transfer to the SCC instruction memory. The Python-based IDE supports extended debug, including break-pointing, single stepping, and animated illustration of the behavior of every signal within the FPGA to uncover the detailed behavior of the processor. A goal of viciLogic is to look inside processor hardware in real time, to demonstrate how programs execute.
The architecture considered as reference, the Xilinx PicoBlaze
This is a snippet of the description of the PicoBlaze in the user guide given by Xilinx: "The PicoBlaze™ microcontroller is a compact, capable, and costeffective fully embedded 8-bit RISC microcontroller core optimized for the Xilinx FPGA families. The KCPSM3 version described in this user guide occupies just 96 FPGA slices in a Spartan®-3 Generation FPGA which is only 12.5% of an XC3S50 device and a miniscule 0.3% of an XC3S5000 device. In typical implementations, a single FPGA block RAM stores up to 1024 program instructions, which are automatically loaded during FPGA configuration. Even with such resource efficiency, the PicoBlaze microcontroller performs a respectable 44 to 100 million instructions per second (MIPS) depending on the target FPGA family and speed grade." [9] . In order to obtain relevant results, during our experiments we have used the Spartan®-3 version - Fig. 4 -(with a maximum clock of 125.3 MHz) of this highly optimized core -written using only structural VHDL code -in the testing phase of our work, presented in the following section. 
Processor architecture evaluation methods
In this section, we investigate the issues of testing basic building blocks of a processor core. One can design special modules, which are implementing fundamental functions that can be seen as a test kernel to be used to devise complex testing schemes. The testing of the ALUs is performed by defining separate phases. The first phase is the computation of test values (generation algorithm of pseudo-random numbers [10] ) while the second phase is the test application followed by the test-response evaluation.
In order to test ALU modules working on 2 n -bit integer values, we need two 2 n -bit vectors, the input carry bit and the operation code. These test vectors can be generated by accumulating a 4 n -bit constant that is than divided into two 2 n -bit values (S1 and S2). Each pair of 2 n -bit test vectors, Z and W, is generated using two additions executed by the ALU. Considering I to be the loop index, Zi is first generated by adding S1 to Zi-1 with the InputCarry=0. Next, Wi is generated in a similar way. The yielding code of the ALU code is obtained with the exclusive sum of Z and W shifted left once. These generated test vectors, Z and W, are then applied to the ALU. The test programs are usually very short and no other test data needs to be stored. They are executed at the full system speed. These are easily parametrized, enabling them to be used in a wide range of applications. Besides implementing the above method on the processors described in the sections above, we have also implemented a custom testing method by developing codes for Proportional and a PID system controller. A sample of the implemented PID algorithm is given in Fig. 5 . In order to assure comparable results, the input signal measurements and the command signal generation have been considered to be achievable only by one I/O instruction execution. The SLP, SCC and the PicoBlaze all gave similar results, as presented in Table 1 (Fmax of the Spartan3E board used, Digilent Nexys3 is 100 MHz, while the Fmax values in the table are given after VHDL synthesis by the Xilinx development tools), while the EMMC goes slightly above them in certain code snippets. This happens when the EMCC find an instruction block with no data dependencies, therefore the two pipelines can be kept filled and no stall events occur. As the other three architectures do not have pipelined structures, the execution time, in these cases, is a measure of the efficiency of the design (number of clock cycles per instruction (CPI) and the diversity of the instruction set) and the maximum achievable speed of the yielded circuit inside the XC3S250E Xilinx Spartan FPGA. 
Conclusions
During our study we have implemented two custom processor cores (SLP and EMMC) using VHDL, and tested another two (SCC and Xilinx PicoBlaze), all for FPGA implementation. The integer operation test of the ALU components showed, that the most important factor in mathematical calculations intensive applications is the CPI value. On the other hand, during the P controller test we have observed, that the EMMC pipelines were stalled, as the code had many instruction blocks with data dependencies. All in all, we would argue, that all these all the processor architectures presented and tested in this paper can be successfully used in FPGA implemented application, where task distribution to several instances of these cores promise performance enhancements. The SCC and PicoBlaze are the most efficient when it comes to simple algorithms that have many data-inter-dependent code blocks, while the EMCC's main advantage is the parallelization capability. The SLP is the slowest in terms of CPI, but can be put to use well in highly recursive algorithms, due to its deep stack, while having the largest set of peripherals available.
