We propose a hybrid rSoC parallel processing architecture consisting of a central 32-bit RISC microprocessor interconnected to an array of &bit microcontrollers as coprocessing nodes. The central processor runs an embedded Linux operating system. with the coprocessor nodes mapped into a virtualfile system, by which they can be controlled and reprogrammed.
Introduction
Performance predictability is one of the strongest constraints in real-time system design. Unlike transformational computing (e.g. data processing and simulation systcms), thc feasibility of a real time system is measured by its ability to guarantee that computational tasks will he completed within a certain deadline. The difficulty of assuring real-time performance increases dramatically with the number of interfering tasks [I] .
Reconfigurable system-on-chip (rSoC) is a powerful tool for real-time system design. Custom architectures that offload processing from the central microprocessor onto customised hardware or coprocessing units can result in more predictable overall system performance.
In this work we propose a hybrid rSoC multiprocessing architecture consisting of multiple individually programmable microcontrollers interfaced to a central microprocessor system, and consider the subsequent impact on system performance and design flexibility.
The hardware consists of a 32hit softcore CPU, connected in a star topology to multiple dynamically reprogrammable microcontroller cores, We have implemented this architecture using the Xilinx Microblaze processor and its little cousin Picoblaze, as the main and coprocessor respectively. We briefly introduce each of these in Section 2.
The software architecture consists OF the uClinux operating system (e.g. [21) running on the Microblaze, with the programmable state machines mapped directly into the kernel space. and made available to user processes via a virtual file system mapping. We use the term picoware to refer to code running on the coprocessor nodes.
We suggest two specific roles for the architecture.
Firstly, by providing an off-chip interface from each coprocessor node, they may be used as intelligent IO processors, offloading main processor load and acting as virtual devices. Low bit-rate communications such as RS232, I2C, SPI, or even low-speed USB can be implemented on the coprocessors. PWM or sigma-delta modulation is another potential use.
The second use-class is for parallelising computational tasks characterised by small data transfers and long computational times. In these cases, the computations can execute in parallel on separate coprocessor nodes. Small block-based cryptographic algorithms are one such computation.
It can he argued that these IO and coprocessing tasks should be implemented more efficiently and compactly in customised hardware, rather than through general purpose programmable coprocessors. We address this point in Section 5. and present a theoretical analysis of the architecture, finally followed by conclusions and directions for future development.
Background
Conceptually the architecture is not tied to any particular processor or device, however its implementation as described in this paper has been tested with Xilinx FPGAs and soft-processor cores. In the following we provide supporting background information.
Microblaze
Microblaze is a compact 32 bit RISC processor, with 32 general purpose registers, and an orthogonal instruction set. It uses a 3 stage instruction pipeline, with delayed branch capahility for improved instruction throughput. Microblaze is specifically targeted to logic primitives of Xilinx FPCA devices, 
Hardware Architecture
In the proposed architecture, Picoblaze coprocessor nodes are connected in a star topology to a central Microblaze, using the FSL links described previously.
In the following we describe the coprocessor nodes and interconnect architecture in detail.
Picoblaze Coprocessor Nodes
The architecture of the coprocessor node is illustrated in Figure 1 , which serves as a reference for the subsequent discussion.
A controller listens at the input FSL interface of the node. A node control operation is indicated by the presence of the FSL control hit (Sect. 2.1.1). Otherwise, the lower 8-bits of the FSL word are pushed onto the Picohlaze's input FIFO as data for subsequent processing.
Currently supported node control operations are Picohlaze code space write, reset, and intermpt.
Code writes are achieved by the master (Microblaze) sending a control command to set a write-address register, which is then automatically incremented on each subsequent opcode write. This seeWsequential programming model efficiently supports both bulk code writes (reprogramming entire code memory), and small changes such as updating individual data items encoded in the Picohlaze opcodes. It is advisable, although not mandatory, to hold the node in reset during code updates. Two 8-bit x 16 deep FIFOs are interfaced onto the Picoblaze's IO space, one each for read and write. The FIFOs can be read and written by Picoblaze in either blocking or non-blocking modes, to mirror the FSL semantics (Sect. 2.1.1) A "Halt" signal was added to the standard Picohlaze core, that forces an extended T-state if a blocking write is attempted on a full output FIFO, or a blocking read on an empty input FIFO. Halt is asserted by the node controller if the input FIFO is empty and the Picoblaze attempts a blocking read operation (and respectively blocking write on a full output FIFO). The Picoblaze may test the FIFO status, to determine whether an operation would block.
The introduction of the FIFOs and optional blockinglnon-blocking operations to the process nodes supports elegant solutions to host-coprocessor synchronization issues. Examples are presented later in the paper.
A tinal feature of each coprocessor node is an 8-hit direction-programmable GPIO port. The GPIO would typically be mapped to external pins of the P G A , however it may also be connected to other logic in the design if required. The GPIO inputs, Microblaze and coprocessor node interconnection through control operations as described previously, or as regular data operations. Data writes to a master port map connect to the node input FIFO, while reads access the node output FIFO.
Masterhlave symmetry between read and write operations is preserved -just as a node may optionally block, so too can Microblaze, with the blocking gedput operations described in Sect. 2.1.1.
In practice, it is highly inadvisable for Microblaze to perform blocking operations -a stalled coprocessor could deadlock the entire system. This is particularly the case when the Microblaze is running a multitasking operating system like uClinux. The uClinux integration of the processor architecture forbids instruction-level blocking. Instead, user processes may elect to block, but at the device driver level the FIFO status is polled periodically. rather than utilizing hardware blocking gedput operations.
Operating System Integration
The hardware architecture just described can be used directly by low level Microblaze software. The procedures to program, control, and communicate with each node are simply combinations of FSL p u t and g e t instructions. Table 2 outlines the primitives of the library developed to support low level coprocessor node communications and control.
This API could also he used directly by userspace uClinux programs. However. we chose to map the architecture into the uClinux kernel, representing each Picohlaze as a system resource that may be acquired, read, and written. We expand on this mapping below, before offering justification for this approach over a potentially lighter-weight direct programming model.
The Linux Virtual File System
Like most modern operating systems, Linux uses a virtual file system (VFS) abstraction model. via NFS, and processing partial bitstreams as though they were regular local files. VFS also gives rise to truly virtual file systems which have no underlying physical manifestation hut instead are constructed dynamically in response to process requests such as reads, writes, directory listings and so on. The best known use of this capability is the Linux proc file system (procfs). Normally mounted into the @roc directory, procfs provides a window into the internal operations of the kernel, allowing processes to inspect and in some cases modify the operation of the kernel.
Coprocessor array mapping into procfs
In accordance with the interpretation of the coprocessor nodes as contigurable system resources, we choose to map these nodes directly into the procfs. This is achieved by writing a simple file-like interface wrapper around the low-level API shown in Table 2 . The procfs interface is implemented as a loadahle kernel module. which creates the virtual directory and file structure upon initialization. data -a Linux character device node. Write operations place data into the Picoblaze's input FIFO, and reads extract data from the output FIFO. This device is discussed in greater detail below.
Before expanding on the details of this structure, it is illuminating to point out some implications of this mapping, using some simple examples.
Programming a node
writing binary opcode data into its/code virtual tile:
A coprocessor node is programmed simply by 5 cat p u l s e q m . h e x > /proc/picoblaze/picoO/~~de
Copying a coprocessor process may be duplicated onto another node:
A coprocessor "thread executing on one node, 
$ cp / p r o c / p i c a b l a z e / p i c a O /~~d~ / p r o c / p i c o b l a z e / p i c o l /~~d~

The FIFO Device Driver
The IproclpicoblazelpicoNIdata device is a regular Linux character device node, that implements kernel level IO buffering. This is in addition to the hardware buffering provided by the 16-deep input and output FIFOs on each processor node. In its present incarnation the Picoblaze node is not able to generate Microhlaze interrupts, thus kernel tasklets are used to poll the physical FIFO status and transfer data between kernel buffers and the hardware FIFOs.
The reasoning behind the additional kernel layer of buffering is the same as for more conventional system devices such as disk units -buffered IO can reduce the overhead of excessive process-kernel interaction by allowing larger chunks of data to he transferred between processes and the kernel at a single time. Maintaining separate buffers for each node also eases the problem of scheduling reads and writes across multiple nodes.
If the coprocessor nodes were to he accessed directly (the light-weight approach mentioned at the beginning of this section), rather than the kernel buffering model, significant custom software would he required in order to efficiently schedule IO operations to the multiple nodes.
The naYve approach -polling each device in a round robin fashion. each time reading as much data as was available, and writing as much data as could fit -would he very inefficient, particularly if the data transmission rates varied across the nodes.
We can easily trade between software and hardware buffering at compilelsynthesis time. The hardware FIFOs are implemented in the Virtex-I1 SRL16 primitive, and can easily be cascaded together to provide arbitrary depths (up to the capacity of the chip). Larger hardware buffers would provide a measurable, although asymptotically limited performance improvement, by increasing the amount of data that can he transferred from the kernel to the hardware on each transfer cycle, and thus reducing task switching (or, potentially, interrupt handling) overhead.
Analysis and discussion
We present an analysis of the architecture. first by outlining some of its useful characteristics, from which we propose two classes of application that can most benefit.
Useful characteristics
The following considers some of the characteristics of the architecture and its implementation that are useful in the context of realtime embedded systems.
Improved predictability
4 s mentioned previously, one of the greatest challenges facing real-time system designers is the ability to guarantee computational deadlines. For conventional microprocessor systems, this commonly requires significant over-engineering of the central processor, to cover a rare conjunction of events (the worst case scenario).
The implementation described in this paper uses Picoblaze coprocessors clocked at 66MH2, executing a consistent 33 million instructions per second, irrespective of what the central processor or other coprocessor nodes are doing.
Migrating real-time tasks onto these coprocessor nodes has two positive side-effects:
Communication between central and coprocessor nodes must still be considered, and ideally should be minimised to take maximum advantage of the architecture.
Logic reuse
Once deployed, an embedded system may require only very infrequent use of some external communication peripheral, such as a serial interface, for in-the-field testing or maintenance. In cost and power sensitive applications it is wasteful to dedicate logic area to a peripheral device that may only be used with a duty cycle of perhaps 1% or potentially much less.
FPGA power consumption is strongly influenced by static leakage currents, which are present even if a particular logic module is not actively switching.
The only way to achieve significant power reduction is to fit the application into a smaller device. The coprocessor node(s) of rarely used IO functions implemented as picoware can be executing other useful tasks when not required for the infrequent IO operations. This is a form of run-time logic reuse -the logic cost of the coprocessor is amortised over the entire application execution time.
Dynamic and/or partial reconfiguration is commonly argued as a means of achieving run-time logic reuse. While this is true in principle, in practice the technique is inadequately supported by design tools and FPGA devices. Significant extra design effort is introduced to meet the floorplanning and inter-module signal routing requirements of partial-reconfiguration support. The network of programmable coprocessors may not have the same performance or flexibility as a customised hardware, however it is dramatically simpler to use.
Depending upon the degree of dynamic reconfiguration, the functional switching time of the reprogrammable architecture may also be faster. It is certainly of finer granularity, with individual opcodes in a coprocessor program able to he modified.
Task execution predictability is improved Central CPU load is reduced
Roles for the architecture
Clearly, an 8-bit Picoblaze will not outperform a dedicated 32-bit Microblaze at the same clock frequency. Meaningful roles for the coprocessor network architecture will leverage the available parallelism, and the subsequent predictability improvements resulting from decreased load on the central processor.
We propose two roles for the architecture. The first is for intelligent virtual IO devices, and the second is for parallel computational coprocessors. We discuss each below, and consider specific constraints that influence the feasibility of such an approach.
Intelligent peripherals
The consistent instruction throughput of the Picoblaze and its relatively high performance (in excess of 50 8-bit MIPS) makes it suitable for a range of IO tasks. The fixed instruction throughput permits software timing loops that would be either impossible or wasteful on the central processor. 
Coprocessing
In the role as a parallel coprocessor for, the most restrictive constraint is communications bandwidth. If Microblaze takes longer to communicate data to the coprocessor than 10 perform the actual computations, the no performance improvement is gained. However it may still make sense to offload the processing, if predictability is improved. A candidate for this role is low rate, high complexity encryption. Picoblaze has been demonstrated performing AES encryption [141, and although at low speeds, that encryption would come at effectively zero cost to the central processor. 
Theoretical analysis
In this section we present a simple mathematical model and analysis of the architecture. Certain reasonable assumptions are necessary -the intention is establish identities and inequalities that describe the range of useful applications.
We assume that computational tasks are discretised such that each task involves communication of. and operations on ri individual data items, and that there are N such tasks to be computed. Let be the time of communicating ri data items between the central processor and a single node, To is the fixed overhead (e.g. system call entry), and Td is the time to transmit one data item. Next. let Tpn(n) he the time for a single Picoblaze node to perform a computation on those ti data items, and let TMB(n) he the time for a Microblaze to perform the same computation, also on ri data items.
For simplicity we assume that Microblaze and Picoblaze implementations use algorithms with the same underlying complexity iI(j(riJ1. Then. let he the sequential speedup of the Microblaze over Picoblare for the specific algorithm (e.g. due to word length differences and other processor architecture benefits).
Finally, let P he the number of coprocessors available. and we assume that this is equal to N, the number of tasks. That is, we do not consider the problem of scheduling N tasks onto P processors if N # P. Figure 3 illustrates the processing sequence of a central Microblaze communicating with N=3 nodes. The Microblaze spends time Tc(n) communicating data to each node, each of which takes TpB(n) to complete. Finally, the communication is repeated to return data to the Microblaze. In this model, the nodes are being used as transformational coproccssors, producing as much data as they are fed. This is one of the worst-case scenarios for the 
N(T,+nTJ << Tpn(n)
(61
Compute-bound:
Mixed -neither dominates. We consider two metrics -the first is the speedup S , the ratio between the parallel and sequential processing times. It represents the overall speedup achieved by a parallel implementation:
For compute-hound systems (6). this simplifies to S N. 7&(n)/TpB(n) = N/K (8) Thus, for compute-bound systems, the proposed architecture will result in an overall speedup ( S I ) if the number of coprocessor nodes N exceeds the sequential speedup of a Microblaze implementation over a Picoblaze implementation (K) . Since N is hounded. the scalability of this speedup is limited.
For communication bound systems the speedup approximates as S ^. Tpn(n)/K (T,,+riTd) 19) However, communication-bound implies Eqn.
(3, and thus we have s <<I (10) This result is unsurprising -a communication hound system spends too much time moving data around. and not enough time actually performing work on that data.
The theoretical analysis shows that migrating heavily compute-bound functions onto coprocessor nodes results in significant offloading of the central processor, and confirms that communication-bound processes are to be avoided.
Conclusions and Future Development
We have presented an architecture for parallel processing of computational and IO tasks on programmable microcontrollers linked to a central microprocessor in a star topology. We have further shown how this coprocessor array may be naturally and efficiently integrated into the Linux operating system executing on the central processor, and that such a mapping provides uniform and simple access to the coprocessor array.
Performance predictability -the impact on central CPU load is minimised by mapping real-time tasks onto coprocessor nodes.
Flexibility -a node with access to an external IO port may be used as a regular computational node when not required for IO duties. Coprocessing tasks not dependent on external node connectivity may be executed on any available node. S i p l i f e d logic re-use -the simplified programming model offers some logic re-use advantages of partial reconfiguration without the excessive complexity.
The proposed architecture can just as easily he applied to systems with more powerful coprocessor nodes. Indeed, with increased logic usage the architecture could be inverted, placing a simple microcontroller at the heart of an array of powerful microprocessors.
Future developments in the work will include experiments with different types of coprocessors, including custom nodes with greater support for numeric processing such as DSP operations, as well as investigations to see how the coprocessors themselves may he used to accelerate specific operating system functionality. More broadly, we are investigating a wide array of single-chip multiprocessing architectures, and their mappings into the Linux operating system paradigm.
Acknowledgements
The support of the Australian Research Council is gratefully acknowledged. The authors would also like to thank Goran Bilski for advice on the implementation of the ideas presented in this paper.
