The design of a parallel computing system using several thousands or even up to a million processors asks for processing units that are simple and thus small in space, to make as many processing units as possible fit on a single die.
Overview
The Null Operand Parallel processor is designed to build a parallel computing system out of large numbers of such instances. Its main purpose is to allow for a prototypical implementation of a dynamic software system as a proof of concept.
While such processors have been designed, existing approaches either allow for substantial simplification ([2009dm] ), or do lack flexible communicating means or sufficient resources ([2011ga] ) needed when it comes to implementing a dynamic software system, i.e. an operating system with user interaction and dynamic process creation.
The Null Operand Parallel processor is composed of a number of processing units, each with its own local fast memory, and a communication switch to allow exchanging messages between the processing units, and between processors that are connected via an external link (see figure 1). Each processing unit provides multiple register sets to implement a number of independant hardware threads. Those threads that are active are scheduled in a simple round robin manner. Threads may block -waiting for data availability -and stop -either controlled or upon fault.
All addressing for a single thread is done relative to base pointers, one for code and constant data, and one for variable data. This way all operations are base register relative, and so any thread is fully relocatable even at runtime.
The single thread is designed to perform sequential instructions one at a time. Instructions are encoded in eight bit each, with no operands encoded. Instead, operations are performed on a single thread local stack 1 .
All addressing is done wordwise, i.e. there are no byte addressable items at all, and consequently, there are no alignment issues. The only register to address subunits of words is the instruction pointer, as there are always four instructions in a word.
For each thread, the processing unit implements a number of data channel ports that are directly connected to the communication switch. The communication switch provides external links, so it is possible to connect processors and thus build a large computation network 2 . Transmission of data to a remote port on a different processor may require using the switches of one or more intermediate processors, whenever there is no direct connection between the originating and the target processor. To handle arbitrary -and especially transitional -channels, the communication switch implements a generic, table driven routing algorithm.
For peripheral data transmission, the communication switch provides a set of bidirectional interfaces to transfer data to and from peripheral units without no specific transmission protocol.
Numbers for the current test implementation: Another two registers are derived from the limits: cp constants pointer equals lc 0 + 64 dp data pointer equals ld 0 + 64
Whenever the instruction pointer is out of range of the code limits, or the stack pointer is out of range of the data limits, the thread is stopped as faulty (see figure 2 ). At start time of a thread, its stack pointer is initialised to ld 1 . When a thread is stopped -be it faulty or not -an exception message is sent to a destination port, stored into the exc register at thread start time.
Communication Switch
To send a message through a channel, first the destination channel port number has to be assigned to the local channel port. Any subsequent message send through this channel is send to that destination, until a new destination is assigned. A channel global port number is 32 bit wide: A channel transmission path in use is blocked from its local to the destination end as long as the message is sent. To end a message, the sender must send a terminatory END token. When sending a message is paused and shall be continued later, the sender is free to send in between a PAUSE token, which -in contrast to the END token -is not delivered to the receiving destination. By sending a PAUSE token, the channel transmission path is freed until sending the message is continued.
Instruction Opcodes
All instructions are encoded as eight bit opcodes, and do not encode operands explicitely but work on stack data.
Boolean values are words, 0 is false, all other values are true. Instructions, that generate a boolean value, will always push -1 for true.
In the following detailed description of the single instructions, m denotes the local memory. Other single character variables denote temporary values.
--sp is the stack pointer predecremented before use, sp++ is the stack pointer postincremented after use.
The port function calculates a global port number from the processors id, the processing units number, the given thread number, and the local channel port number:
Immediate Constant
Opcodes: 0x00..0x7F, 0xC0..0xFF
All opcodes, except those in the range from 0x80 to 0xBF, do load their own constant value -sign extended -onto the stack:
No Operation
Name: NOP Opcode: 0x80
No operation:
Add
Name: ADD Opcode: 0x81
Pop two words, push the sum:
Subtract
Name: SUB Opcode: 0x82
Pop a subtrahend, pop a minuend, subtract the subtrahend from the minuend, push difference:
Multiply
Name: MUL Opcode: 0x83
Pop two words, push the product:
Unsigned Divide
Name: UDIV Opcode: 0x84
Pop a divisor, pop a divident, unsigned divide divident by divisor, push quotient and remainder. On division by zero stop thread:
Signed Divide
Name: SDIV Opcode: 0x85
Pop a divisor, pop a divident, signed divide divident by divisor according to the Euclidean division (see e.g. [2001dl] ), push quotient and remainder. On division by zero stop thread:
Bitwise And
Name: AND Opcode: 0x86
Pop two words, push the conjunction:
Bitwise Or
Name: OR Opcode: 0x87
Pop two words, push the disjunction:
Bitwise Exclusive Or
Name: XOR Opcode: 0x88 pop two words, push the exclusion:
Pop
Name: POP Opcode: 0x89
Pop one word and discard it:
Duplicate
Name: DUP Opcode: 0x8A
Pop one word and push it twice:
Exchange
Name: EXCH Opcode: 0x8B
Pop one word, pop another, then push the one, push the other:
Load From Stack
Name: LDX Opcode: 0x8C
Pop an index, add it to the stack pointer, push word it points to:
Swap Bit Fields
Name: SWAP Opcode: 0x8D pop a 5 bit mask, pop a word, for each set bit i in the mask reverse all 32 bits in a word, exchanging each bit with the bit at offset 2 i . Use mask 24 for endianess swap, use mask 31 for full bitwise reversal:
4.16 Load, Decrement, Push, Store Name: DECLD Opcode: 0x8E
Pop an index, add it to the data pointer, load the word it points to, decrement it, push it, store it back:
Logarithm
Name: LOG2
Opcode: 0x8F
Determine position of highest bit set, -1 for zero word:
Shift And Rotate Left
Name: LEFT Opcode: 0x90
Pop a bit count, pop a rotator word and a shifter word, shift the shifter by count bits to left, shifting in the bits from the rotator, which is rotated to the left by count bits, push the shifter result:
Shift And Rotate Right
Name: RIGHT Opcode: 0x91
Pop a bit count n, pop a rotator word and a shifter word, shift the shifter by n bits to right, shifting in the bits from the rotator, which is rotated to the right by n bits, push the shifter result:
Sign Extend
Name: SIGN Opcode: 0x92
Pop a word, push true if it is negative, false otherwise:
Zero
Name: ZERO Opcode: 0x93 pop a word, push true if it is zero, false otherwise:
Unconditional Jump
Name: UJP Opcode: 0x94
Pop an offset, add it to the current instruction pointer: a ← m(sp++) ip ← ip + a
False Jump
Name: FJP Opcode: 0x95
Pop an offset and a condition, if condition is false, add the offset to the current instruction pointer:
Load Constant
Name: LDC Opcode: 0x96
Pop an index, add it to the constant pointer, push word it points to. However, if that pointer is out of range of the constant limits, then subtract the constant pool size (lc 1 -lc 0 ) and add the data pointer instead:
Load Word
Name: LD Opcode: 0x97
Pop an index, add it to the data pointer, push word it points to: a ← m(sp++) m(--sp) ← m(a + dp) ip ← ip + 1
Store Word
Name: ST Opcode: 0x98
Pop an index, add it to the data pointer, pop a word and store it: a ← m(sp++) m(a + dp) ← m(sp++) ip ← ip + 1
Count Bits
Name: COUNT Opcode: 0x99
Pop a word, count the number of bits set: Start a new thread with given limits, ip, and exception destination port, set sp to ld 1 , dp to ld 0 +64, and cp to lc 0 +64, push port id of new threads control port:
Call Function
Name: CALL Opcode: 0x9D
Pop an offset, add it to the current instruction pointer, save return address to the stack:
Return From Function
Name: JUMP Opcode: 0x9E
Restore the instruction pointer from the stack: a ← m(sp++) ip ← a + lc 0
Store To Stack
Name: STX Opcode: 0x9F
Pop an index, pop a word, add the index to the stack pointer, store word:
4.34 Load, Push, Increment, Store Name: LDINC Opcode: 0xA0
Pop an index, add it to the data pointer, load the word it points to, push it, increment it, store it back: a ← m(sp++) b ← m(a + dp) m(--sp) ← b b ← b + 1 m(a + dp) ← b ip ← ip + 1
Get Port Destination
Name: GETPORT Opcode: 0xA1
Pop local port id, push local ports destination:
Set Port Destination
Name: SETPORT Opcode: 0xA2
Pop local port id and remote port id, set local ports destination:
Output Word
Name: OUT Opcode: 0xA3
Pop local port id and a data word, send data into port; may block:
Output End Token
Name: OUTEND Opcode: 0xA4
Pop local port id, send an END token into port; may block:
Output Pause Token
Name: OUTPAUSE Opcode: 0xA5
Pop local port id, send a PAUSE token into port; may block:
Input Word
Name: IN Opcode: 0xA6
Pop local port id, receive a data word from port, push it; may block:
Check Input Port
Name: INMORE Opcode: 0xA7
Pop local port id, check whether end token was received, discard end token and push zero if so, push non-zero otherwise; may block:
Clear Event List
Name: EVCLEAR Opcode: 0xA8
Clear the event list for current thread: ev * , * ← ǫ ip ← ip + 1
Set Output Event
Name: EVOUT Opcode: 0xA9
Pop local port id, add event handle for output availability, pop an offset, add it to the current instruction pointer, use as event vector:
Set Input Event
Name: EVIN Opcode: 0xAA
Pop local port id, add event handle for input availability, pop an offset, add it to the current instruction pointer, use as event vector:
Set Input End Event
Name: EVEND Opcode: 0xAB
Pop local port id, add event handle for input end token availability, pop an offset, add it to the current instruction pointer, use as event vector:
Wait
Name: WAIT Opcode: 0xAC
Wait for any of the configured events. as soon as any one occurs, load the corresponding event vector as new instruction pointer; may block: if out i = ǫ and ev a,out then ip ← ev a,out or if in i = END and ev a,end then ip ← ev a,end else if in i = ǫ and ev a,in then ip ← ev a,in else wait
Current Time
Name: NOW Opcode: 0xAD
Push the current time counter:
Wait With Timeout
Name: WAITTMO Opcode: 0xAE
Pop a time, wait for any of the configured events. As soon as any one occurs, load the corresponding event vector as new instruction pointer. When no event occurs until the specified time is reached, continue instruction execution without branching; may block: a ← m(sp++) if time -a ≥ 0 then ip ← ip + 1 or if out i = ǫ and ev a,out then ip ← ev a,out or if in i = END and ev a,end then ip ← ev a,end else if in i = ǫ and ev a,in then ip ← ev a,in else wait
Increment Stack Pointer
Name: POPN Opcode: 0xAF
Pop a summand, add it to the stack pointer:
Compare Unsigned
Name: ULESS Opcode: 0xB0
Pop a subtrahend, pop a minuend, subtract the subtrahend from the minuend, push true if negative, false otherwise, all unsigned:
Compare Signed
Name: SLESS Opcode: 0xB1
Pop a subtrahend, pop a minuend, subtract the subtrahend from the minuend, push true if negative, false otherwise, all signed:
Combined Number
Name: COMBINE Opcode: 0xB2
Pop a word, pop another word, multiply the latter by 192, push the sum:
Calculate Global Port Number
Name: PORT Opcode: 0xB3
Pop a word, calculate the global port id, push it:
Calculate Stack Pointer
Name: LDAX Opcode: 0xB4
Pop an index, add it to the stack pointer, push the data pool index for the word it points to: a ← m(sp++) a' ← a + sp -dp m(--sp) ← a' ip ← ip + 1
Available Threads
Name: THREADS Opcode: 0xB5
Determine the number of threads available to start:
Cycles Per Threads
Name: THRCYC Opcode: 0xB6
Push the instruction cycle counter of the thread:
Total Cycles
Name: CYCLES Opcode: 0xB7
Push the total instruction cycle counter of all threads of the processing unit:
Implementation
There is no hardware implementation of the Null Operand Parallel processor, but a software simulation. As this simulator is to be taken as a proof of concept only, optimisation efforts have been restricted to the bare minimum.
It may be given a number of command line parameters to take influence on its detailed behaviour. Next to a number of possible options, it is given a socket number so it can handle a number of sockets, one for each of the four external links. Furthermore, it may be given up to four socket numbers of other simulator instances which it shall connect to:
For a first compiler to support code generation, see [2016os] . However, there is no reason to not implement further compilers for other languages.
Options to the simulator are: -i --init file initial code to read, first five words skipped -f --file file open file on peripheral link, starting at third, i.e. at internal peripheral link #2. link #0 is always connected to standard input, link #1 is always connected to standard output -t --trace=mask full trace, optionally restricted to threads: For the 32 bit mask, each single bit represents a thread, starting at processing unit #0, thread #0, processing unit #0, thread #1, and so forth, up to processing unit #3, thread #7. For each bit set, for the corresponding thread a full trace is sent to stderr. The mask may be omitted to ask for trace of all threads -x --extern=mask external links trace, optionally for selected links. Bit 0 to 3 represent the external links, bit 4 to 11 represent peripheral lines, bit 12 represents the router configuration block -l --intern=mask internal links trace, optionally restricted to threads. The mask to be used as with --trace -d --debug enable break instruction and debug mode -h --help show help and exit Initially, the simulator will open sockets first-own-socket to first-own-socket+3. For the first n among these, with n the number of connect-to-socket parameters, it will attempt to connect to the latter sockets. For the remaining 4 − n sockets, i.e. first-own-socket+n to first-own-socket+3, it will wait for another simulator to connect to it. Only when all four sockets are connected, the simulator will start executing opcodes.
As a special case, a stand-alone system may be simulated by omitting all parameters.
Each of the processing units will run code from a simulated boot ROM at address 0x3fc0, which contains the following instructions intended to read some initial code from its first internal port (see option --init): 
Discussion
The main purpose of the Null Operand Parallel processor is to allow for a prototypical implementation of a dynamic software system as a proof of concept, and its design is restricted to features that are needed to serve this purpose. For instruction encoding a single operation stack oriented approach without registers -occasionally referred to as bytecode -has been chosen, to achieve compact encoding, and to allow for a most simple code generation, reducing effort in compiler implementation. Clearly there are disadvantages, e.g. code execution is slower than with a multiple register encoded instruction set, because extra instructions are needed to fetch and store operands. Other stack oriented designs have been proposed, that implement multiple operations per instruction, with the overall number of instructions reduced as to allow for more compact implementation in hardware (e.g. [2010jb] ).
Program flow exceptions have been replaced by exception messages, but these need to be reduced to a minimum, as a process should care for as much exceptional states as possible on its own. Exceptional states may be divided into two classes, one class are exceptional states that arise due to wrong encoding, e.g. illegal instructions, resource inavailabilities, stack overflow and other range check faults. These exceptions need external handling, so it is inevitable to send a message to some responsible process. The second class are exceptional states that arise due to unexpected input data, e.g. division by zero, and sudden reception of an end token where a message was not expected to end. These exceptions should all be handled by the process itself. For sudden reception of an end token, the instruction encoding could be chosen to force evaluation of the token type received by the process, e.g. by extending the input instruction to return an extra boolean to indicate the type of the token received. For division by zero, a similar approach could be taken, e.g. by extending the division instruction to return an extra boolean to indicate successful operation, or by simply redefining the division operation to result in a defined value whenever the denominator is zero. The latter still requires the process to do some extra check on the denominator first, but this is not worse than the traditional exception based handling.
Directly connected to exception handling is the question of how to respond to processes dying prematurely, resulting in ports an thus channels to be no longer available. For channels used for outgoing messages, any pending message is finalised with an end token, but for channels used for incoming messages, the process sending a message would need to be notified about the fact that the destination has vanished.
Support for floating point calculation has not been implemented, as it is not needed for basic operating system design.
True general purpose I/O lines are not foreseen, because an implementation in hardware is not planned with this first design. There are other designs that prove smooth integratability into a channel based design is possible (e.g. [2009dm] ).
