Abstract
Introduction
OISC (One Instruction Set Computer) is the ultimate RISC (Reduced Instruction Set Computer) with conventional CPU, where the number of instructions is reduced to one. Having only one available processor instruction eliminates the necessity in opcode and permits simpler computational elements, thus allowing more of them implemented in hardware with the same number of logical gates. Since our goal was to build a functional multi-processor system with a maximum possible number of processors on a single low-cost programmable chip, OISC was a natural choice, with the remaining step being selection of a suitable single-processor instruction set.
Currently known OISC can be roughly separated into three broad categories: Transport Triggered Architecture (TTA) is a design in which computation is a side effect of data transport. Usually some memory registers (triggering ports) within common address space, perform an assigned operation when the instruction references them. For example, in an OISC utilizing a single memory-to-memory copy instruction [1] , this is done by triggering ports performing arithmetic and instruction pointer jumps when writing into them. Despite appealing simplicity, there are two major drawbacks associated with such OISC. First is that CPU has to have separate functional units controlling triggering ports. The second are difficulties with generalization of the design, since any two different hardware designs are likely to use two different assembly languages. Because of these disadvantages we ruled out this class of OISCs for our implementation.
Bit Manipulating Machines is the simplest class. Bit copying machine, called BitBitJump, copies one bit in memory and passes the execution unconditionally by the address specified by one of the operands of the instruction [2] . This process turns out to be capable of universal computation (i.e. being able to execute any algorithm and to interpret any other universal machine) because copying bits can conditionally modify the code ahead to be executed. Another machine, called Toga computer, inverts a bit and passes the execution conditionally depending on the result of inversion [3] . Yet another bit operating machine, similar to BitBitJump, copies several bits at the same time. The problem of computational universality is solved in this case by keeping predefined jump tables in the memory [4] . Despite simplicity of the bit operating machines, we ruled them out too, because they require more memory than is normally available on an inexpensive FPGA. To make a functional multiprocessor machine with bit manipulation operation, at least 1Mb of memory per processor is required. Therefore we decided that more complex processor with less memory is a better choice for our purposes.
~ 3 ~
Arithmetic based Turing-complete Machines use an arithmetic operation and a conditional jump. Unlike two previous classes which are universal computers, this class is universal and Turing-complete in its abstract representation. The instruction operates on integers which may also be addresses in memory. Currently there are several known OISCs of this class, based on different arithmetic operations [5] : addition -Addleq, decrement -DJN, increment -P1eq, and subtraction -Subleq (Subtract and Branch on result Less than or Equal to zero). The latter is the oldest, the most popular and, arguably, the most efficient [6] [7]. A Subleq instruction consists of three operands: two for subtraction and one for conditional jump.
Attempts to build hardware around Subleq have been undertaken previously. For example, David A Roberts designed a Subleq CPU and wrote a software Subleq library [8] . His implementation is a single CPU with keyboard input, terminal, control ROM, and 16Mb RAM, and is much more complex than ours. There were a few other similar designs described on various Internet sites, e.g. [9] . However all of them were just proof-of-concept simulations without a practical implementation.
In the following sections we describe components of the system we built. Section 2 outlines Subleq abstract machine and its assembly notation. Section 3 describes our hardware implementation of the multiprocessor core. Section 4 briefly describes techniques used to convert high level programming language into Subleq instruction code. In Sections 5 and 6 we present comparative speedtest results for our device, followed by a discussion and summary. In the Appendix the code calculating factorials is presented in C and Subleq notations. Figure 1 represents the connection of our device to a computer with USB cable.
Subleq Assembly Language
Subleq abstract machine operates on an infinite array of memory, where each cell holds an integer number. This number can be an address of another memory cell. The numbering starts from zero. Program is defined as a sequence of instruction read from the memory with the first instruction at the address zero. A Subleq instruction has 3 operands:
Execution of one instruction A B C subtracts the value in the memory cell at the address stored in A from the content of a memory cell at the address stored in B and then writes the result back into the cell with the address in B. If the value after subtraction in B is less or equal to zero, the execution jumps to the address specified in Assembly notation helps to read and write code in Subleq. The following is the list of syntax conventions:
• label;
• question mark;
• reduced instruction;
• multi-instruction;
• literal and expression;
• data section;
Label is a symbolic name of a particular address followed by a colon. In the following example each line represents one instruction with three operands. Here A, B, and C are not abstract names, but labels (addresses) of specific cells in the memory. For example, label A refers to the 4th cell, which is initialised with value 2. The first instruction subtracts value of the cell A from the value of the cell B, which value is 1, and stores the result in the cell B, which becomes -1. Since the result is less than zero, the next instruction to be executed is the third line, because value C is the address of the first operand of the instruction on the third line. That subtracts B from B making it zero, so the execution is passed to the address 0. If these three lines is the whole program, then the first operand of the first instruction has the address 0. In this case the execution is passed back to the first instruction which would make B -2. That process continues forever. The instructions being executed are only the first and the third lines, and the value of the cell B changes as 1, -1, 0, -2, 0, -2, 0, and so on.
Question mark is defined as the address of the next cell in memory:
is the same as
Reduced instruction format is a convenient shortcut: two operands instead of three assume the third to be the address of next instruction, i.e. ?; and only one operand assumes the second to be the same as the first, so If more than one instruction is placed on the same line, each instruction except the last must be followed by semicolon. Comments are delimited by hash symbol #: everything from # till the end of the line is a comment. Jump to a negative address halts the program. Usually (-1) as the third operand is used to stop the program, for example:
The parentheses around (-1) are necessary to indicate that it is the third operand, so the instruction would not be interpreted as
To make a Subleq program interactive (requesting data and responding to user while working), input and output operations can be defined as operations on a non-existing memory cell. The same (-1) address can be used for this. If the second operand is (-1) the value of the first operand is the output. If the first operand is (-1), the second operand gets the value from the input stream. Input and output operations are defined on byte basis in ASCII code. If the program tries to output a value greater than 255, the behaviour is undefined.
Below is a "Hello world" program adapted from Lawrence Woodman The program above consists of five instructions. First instruction prints the character pointed by its first operand (the first pointer) which is initialised to the beginning of the data string -the letter 'h'. Second instruction increments that pointer -the first operand of the first instruction. Third instruction increments the second pointer, which is the second operand of the forth instruction. Forth instruction tests the value pointed by the second pointer and halts the program when the value is zero. It becomes zero when the pointer reaches the cell one after the end of the data string, which is Z:0. The fifth instruction loops back to the beginning of the program, so the process continues until the halt condition is not satisfied.
Hardware design

Overview
We have used Altera Cyclone III EP3C16 FPGA as the basis for hardware implementation. The choice was based on relatively low price (~ US$30) of this FPGA IC chip and availability of the test hardware for it.
The test board we used has a DDR2 RAM IC fitted, but access to the RAM is limited by one process at a time. True parallel implementation requires separate internal memory blocks allocated for each processor hence the amount of available memory in the FPGA limits the number of processors. The EP3C16 FPGA has 56 of 16 bit memory blocks of 8 Kbits each. Our implementation of one 32 bit Subleq processor requires minimum 2 memory blocks, so only 28 processors can be fit into the FPGA. We could choose 16 bit implementation and have more processors (up to 56), but with only 1 Kbyte of memory allocated to each.
FPGA is connected to USB bus with the help of external Cypress FX2 CPU that is configured as a bridge between USB and SPI (Serial Peripheral Interface) utilised to load FPGA with code and data. The interface bridge is transparent for the PC software. To increase the number of processors, additional boards with FPGAs could be easily connected via USB bus.
Interface description
Each processor has a serial interface to the allocated memory and the Status byte, accessible from a single address serial loading. The serial interface takes over memory's data and address buses when processing is stopped.
Address space inside the FPGA is organised as a table addressed by two numbers: processor index and memory address. Reading one byte from index 0 returns the number of processors inside the FPGA. For this design the returned value is 28. Indices from 1 to 28 are assigned to the processors, which have 2048 bytes (512 of 32 bit words) of memory available to each.
Writing to a processor memory is an operation which sequentially loads a buffer of 2048 bytes. Reading from the processor's memory is different: the first word (4 bytes) returned is the status of the processor and the rest is the memory content.
The Status byte -the first byte of the first word -can be in one of three states: 0xA1 -running, 0xA2 -stopped, or 0xA0 -stopped and not run since power on. Writing into the processor's memory automatically starts execution, thus eliminating the need in a separate command. Reading from a processor's memory stops that processor. An exception is reading the first byte of status, which does not stop the processor. Additionally, a processor can be stopped by Subleq halt operand (-1) as mentioned in Section 2. Other negative references, such as input or output as described above in the Subleq assembly language section, also stop the processor, because no IO operations are defined in this architecture.
Subleq processor
The state machine algorithm can be presented in pseudocode as: where IP is an instruction pointer, memory[] is a value of a memory cell, and A, B, and C are integers.
The Subleq processor core is written with the help of RAM 2 Port Megafunction of Quartus II software we used to build dual port memory access. The implemented solution allows access to content by two distinct addresses (memory[A] and memory[B]) simultaneously, which results in saving on processing clock ticks. The disadvantage of this implementation is an additional latency of one clock tick to access the data and address buses comparing to a single port memory implementation. However the total of processing clock ticks per memory access for the dual port is still less than that required for a single port.
The core is based on a state machine, which starts automatically when memory of a processor is loaded. On any read or write operation, or encountering a negative operand the processing stops, but the first status byte can be read at any time without affecting the computation.
C Compiler for Subleq
In this section we briefly describe some elements of the compiler we wrote, which compiles simplified C code into Subleq [11] . The compiler is used in one of our tests, so that direct comparison is possible between execution of a compiled native C code and a Subleq code, compiled from the same C source. The compiler is a high-level language interface to OISC -the only known to us such compiler at the time of writing.
Stack
The primary C programming language concepts are functions and the stack. Implementation of the stack in Subleq is achievable by using memory below the code. Using code self-modification one can place into and retrieve from the stack values. Function calls require return address to be placed into the stack. Consider the following C code:
... } After the above is compiled to machine code, it must perform the following operations 1) the address of the instruction immediately after calling f has to be put on the stack; 2) a jump to the code of the function f must be made 3) at the end of the function f, the address from the stack needs to be extracted and 4) the execution should be transferred to the extracted address. According to C standard, the function main is a proper C function, i.e. it can be called from other functions including itself. Hence the program must have a separate entry point, which in the following code is called sqmain. The C code above compiles into: The cell stack pointer sp is the last memory cell in the program. It is initialised with the negative value of its own address. Negative value is used here to speed up the code execution -working with subtraction operation may sometimes save a few steps if the data is recorded as negative of its actual value. The instruction dec sp subtracts 1 from sp, hence increasing its real value by 1. Below is an excerpt calling the function f in more readable form -relative references ? are replaced with labels. The instruction on the forth line is to clear the cell in the stack, since some values can be left there from the previous use. However, to clear the top cell in the stack is not an single-step task because one has to clear the operands of the instruction itself and then initialise it with the value of sp pointer. Thus the execution command sequence is the following: allocate a new cell in the stack by increasing stack pointer (first line); clear the first operand of the instruction; initialise this operand with the new value of the stack pointer (second line); do the same with the second operand of the instructionclear and initialise (third line); and then execute the instruction, which will clear the allocated cell in the stack (forth line).
The next two instructions clear and initialise cell C similarly. The instruction D C:0 _f copies the address of the instruction inc sp to the stack and jumps to _f. This ~ 10 ~ works, because D holds the value of the next memory cell (remember ?) and C points to the now cleared top cell on the stack. A negative value written to the stack then forces a jump to the label _f.
Once inside the function f, stack pointer can be modified, but we assume that functions restore it before they exit. So the return code has to jump to the address extracted from the stack:
A; sp A B; A:0 B Z Z B:0
Here the value of stack pointer sp is written to A, and the instruction A:0 B copies the stored address to B. The address stored negatively so the positive value is being restored.
The stack does a little bit more than just storing return address. That will be explored later in subsections 4.3 and 4.4. This mutually recursive definitions can be used by a program called parser to build a tree representation for any grammatically valid expression. Once such tree is build, the compiler job is to organise the sequence of instructions so that the result of any sub-tree is passed up the tree. For example, a tree of the expression:
Expressions
consists of a node '+', variable a and a sub-tree, which consists of a node '-', and variables b and c. To make calculation, the compiler must use a temporary variable to store the result of the sub-tree, which has to be used later in addition; and potentially to be used further up if this expression is a part of a larger expression. In this particular example we need only one temporary, but generally many temporaries are required. The expression is compiled into the following code:
t; b Z; Z t; Z c t a Z; Z t; Z ~ 11 ~
The first line copies value b into temporary t. The second line subtracts value c from the temporary. At this point the compiler is finished with sub-tree. Its result is the generated code and the temporary variable t holding the value of the calculated subtree. Now compiler generates code for the addition. Its arguments now are variable a and temporary t. The third line adds a to t. Now t holds the result of the whole expression. If this expression is a part of a larger expression, t is passed up the tree as an argument to the upper level node of the tree. If not, then t value is discarded because the evaluation is finished.
More advanced grammar may involve assignment, dereferencing, unary operations and many others. But each grammar construction can be represented by a corresponding sub-tree, and processed later by the compiler to produce code. For example, a subtraction from a dereferenced value represented in C as:
has to be translated into t; k Z; Z t; Z a t:0
Here a temporary variable has to be used inside the code for dereferencing. The sequence of instructions is: clear t; copy value k into t; subtract a from the memory k points to.
Here a few elements of the grammar processing have been touched. C grammar takes several pages just to list the BNF. However larger and more complex grammars are resolved by the compiler is similar manner.
Function calls
In the subsection 4.1 above it was shown how to push into and pop from the stack. When a function takes arguments, they have to be pushed into the stack together with the return address. The stack must be restored upon function return. Consider a function taking two arguments:
The call to a function f must be translated into something like this In C arguments can be expressions and the call to a function can be a part of another expression -sub-expression, i.e. the compiler must properly handle more complicated cases like the following ~ 12 ~ Here for simplicity C function type int(*)(int,int) is represented as int. Subleq supports only one variable type. Therefore, more elaborate typing system does not introduce extra functionality in the language.
Arguments pushed into the stack can properly be calculated as sub-expressions (subtree). In this sense for the actual function call it is irrelevant either program variables or temporary are pushed into the stack. The code above handles neither return value nor indirect calls yet. Return value can be stored in a special variable (register). If the program uses the return value in a subexpression, then it must copy the value into a temporary immediately upon return. Indirect calls can be achieved by dereferencing a temporary holding the address of the function. It is straightforward, but more complex code.
Stack pointer can be modified inside a function when the function requests stack (local) variables. For accessing local variables usually base pointer bp is used. It is initialised on function entrance; is used as a base reference for local variables -each local variable has an associated offset from base pointer; and is used to restore stack pointer at the end of the function. Functions can call other functions, which means that each function must save upon entry and restore upon exit base pointer. So the function body has to be wrapped with the following commands: Since all used temporaries in the function are pushed into the stack, it pays off to reduce the number of used temporaries. It is possible to do this just by releasing any used temporary into a pool of used temporaries. Then later when a new temporary is requested, the pool is first checked and a new temporary is allocated only when the pool is empty.
The expression 1+k [1] compiles into t1; t2; _k t1; dec t1; t1 t2 t3; t4; ?+11; t2 Z; Z ?+4; Z; 0 t3; t3 t4; t5; t6; dec t5; t4 t5; t5 t6 # result in t6
When pool of temporaries is introduced the number of temporaries is halved:
t1; t2; _k t1; dec t1; t1 t2 t1; t3; ?+11; t2 Z; Z ?+4; Z; 0 t1; t1 t3 t1; t2; dec t1; t3 t1; t1 t2 # result in t2
which dramatically reduces the code removing corresponding push and pop operations.
Stack variables
Once bp is placed on the stack and sp is decremented to allocate memory, all local variables become available. They can be accessed only indirectly because the compiler does not know their addresses. For example, the function f in
has 4 local variables with the stack size equal to 6. When this function is entered the stack has the following values:
The compiler knows about the offset of each variable from bp. y  -3  x  -2  a  1  b  2  c  3  d  6 Hence, in the code any reference to a local variable, not pointing to an array, can be replaced with *(bp+offset). The array c has to be replaced with (bp+offset) because the name of array is the address of its first element. The name does not refer to a variable, but the referencing with [] does. In C
~ 15 ~
Variable Offset
which can be interpreted in our example as
Multiplication
The only trivial multiplication in Subleq is multiplication by 2, t=a+a:
t; a Z; a Z; Z t; Z To multiply 2 numbers one can use formula
This is a simple recursive formula, but it requires integer and modular division. Division can be implemented as the following algorithm. Given two numbers A and B, B is increased by 2 until the next increase gives B greater then A. At the same time as increasing B, we increase another variable I by 2, which has been initialized to 1. When B becomes greater then A, I holds the part of the result of division -the rest is to be calculated further using A-B and original B. This can be done recursively accumulating all I's. At the last step when A<B, A is the modulus. This algorithm can be implemented as a short recursive function in C. Upon the exit this function returns the integer division as the result and division modulus in the argument j. 
Conditional jump
In C, Boolean expressions which evaluate to zero are false and non-zero are true. In Subleq this leads to longer code when handling Boolean expressions because every Boolean expression evaluates on basis of equality or non-equality to zero.
A better approach is to treat less or equal to zero as false and positive value as true. Then if-expression if(expr){<body>} will be just one instruction where t is the result of the expression expr. However to remain fully compatible with C (for example, if(x+1){...} -an implicit conversion to Boolean) all cases where integer expression is used as Boolean have to be detected. Fortunately there are only a few such cases:
The job can be done inside the parser, so the compiler would not have to care about Boolean or integer expression, and it can produce much simpler code.
In cases when a Boolean variable is used in expressions as integer, like in:
• passing an argument f(a>0)
• returning from a function return(a>0);
• assignment x=(a>0);
• other arithmetic expression x=1+5*(a>0); the variable must be converted to C-style, i.e. negative result zeroed. This can be done as simple as A terse check for a value being less than, equal to, or greater than zero is:
where L, E, and G are addresses to pass the execution in cases when x is less than, equal to, or greater than zero respectively. Figure 3 shows the schema of the execution. Note, that x does not change and Z is zero on any exit! To test the efficiency of the board we chose two mathematical problems. The first calculates the size of a function residue of an arithmetic group. The second calculates modular double factorials.
Results
Test #1
In the first test we selected a problem of finding order of function residue of the following process: where x and y are integers initialised to 1, mod is a modulo operation, M is some value. Starting from the point (x 0 =1,y 0 =1) the equations generate a sequence of pairs. We chose this problem because its solution is difficult, with answers often much greater than M (but less than M 2 ). Number M was selected such, that the calculations could be completed in few minutes. When this sequence is sufficiently long, a new pair of generated numbers will eventually be the same as a pair previously generated in the sequence. The task is to find how many steps has to be completed before first occurrence of the result with the same value. In our test the selected value of M was M=5039 and the number of iterations was calculated as 12693241. The From these results we conclude that the speed of a single processor on FPGA is of the same order of magnitude as the speed of CPU of ordinary PC when emulating Subleq instructions. Native code on PC runs about hundred times faster.
Test #2
The second test was a calculation of modular double factorials, namely The 28 FPGA processors easily outperform the emulation of the same Subleq code on PC. C code without multiplication compiled into Subleq and emulated runs faster than C code with multiplication, because the compiler's library multiplication function is not as efficient as the multiplication function written in this example.
Conclusion
Using inexpensive Cyclone III FPGA we have successfully built an OISC multiprocessor device with processors running in parallel. Each processor has its own memory limited to 2 Kb. Due to this limitation we were unable to build a multiprocessor board with even simpler individual processor instruction set, such as e.g. bit copying [2] , because in that case, to run practically useful computational tasks the minimum required memory is ~ 1 Mb of memory per processor. The limited memory available in our device also did not permit us to run more advanced programs, such as emulators of another processor or use more complex computational algorithms, because all the computational code has to fit inside the memory allocated for each processor.
~ 20 ~ The size of memory available to each processor can be increased by choosing larger and faster albeit more expensive FPGA such as Stratix V. Then a faster processing clock and larger number of CPUs could be implemented as well. The VHDL code of the CPU state machine could also be optimised improving computational speed. Given sufficient memory, it would be possible to emulate any other processor architecture, and use algorithms written for other CPU's or run an operating system. Apart from the memory constrain, another downside of this minimalist approach was reduced speed. Our board uses rather slow CPU clock speed of 150 MHz. As mentioned above, more expensive FPGA can run at much faster clock speeds.
On the other hand, the simplicity of our design allows for it to be implemented as a standalone miniature-scale multi-processor computer, thus reducing both physical size and energy consumption. With proper hardware, it might also be possible to power such devices with low power solar batteries similar to those used in cheap calculators. Our implementation is scalable -it is easy to increase the number of processors by connecting additional boards without significant load on host's power supply. A host PC does not have to be fast to load the code and read back the results. Since our implementation is FPGA based, it is possible to create other types of runtime reloadable CPUs, customised for specific tasks by reprogramming FPGA.
In conclusion, we have demonstrated feasibility of OISC concept and applied it to building a functional prototype of OISC multi-processor system. Our results demonstrate that with proper hardware and software implementation, a substantial computational power can be achieved already in a very simple OISC multi-processor design.
