Dynamically reconfigurable systems either adopt a processor-controlled networked architecture or a sequencer-controlled data flow architecture. In the networked architecture, the processor is overloaded with data transfer requests, whereas in the data flow architecture, the burden is completely shiftedfrom the processor to the data sequencer As a tradeoff between these two extremes, this work proposes a novel module sequencer architecture, which not only allows the processor and the sequencer to share the heavy data communication load, but is also more coherent with the conventional processor-FPGA architecture. Further, the architecture is highly flexible because it can be tuned to fit a particular application. Application examples show how the proposed architecture is superior to the networked architecture in terms of lower communication load and to the data flow architecture in terms ofreduced system complexity.
Introduction
With technology progress, the advent of the FPGA represents a trade off between performance and flexibility. Given the large amount of resources, Dynamically Partially Reconfigurable Systems (DPRS) can now be implemented in a single FPGA [5] . Unlike von-Neumann based architectures, there are currently no standard memory hierarchy and communication schemes for DPRS. However, two communication architectures are commonly adopted, namely processor-controlled network architecture (PNA) and sequencer-controlled dataflow architecture (SDA). The main problem in PNA is that the processor is easily overloaded with too many communication requests. The challenge in SDA is that the high complexity in generating lowlevel data flow instructions makes optimization difficult and thus it is not easy to achieve high communication performance.
As a tradeoff between the low communication performance of network architectures and the high complexity of data flow architectures, a novel module sequencer architecture (MSA) is proposed in this work, which solves the issues related to reduced communication overhead, simplified programming model, simplified bus architecture, and virtual function mapping.
This article is organized as follows. Section 2 discusses related research work and compares them with our architecture. The proposed module sequencer architecture is described in Section 3. The illustration examples are given in Section 4. Finally, conclusions are described in Section 5.
Related Work
Instead of describing the generally well-known bus or NoC based PNA, we focus on two typical SDA in this section. Transport Triggered Architecture (TTA) [1] [2] , was proposed for customizing application-specific instructionset processor (ASIP) designs. The TTA is a static hardware with simple design that moves the application complexity from hardware to software or the compiler design. Reconfigurable Pipelined Datapaths (RaPiD) [3] [4] is a domainspecific coarse-grained reconfigurable architecture. RaPiD is a typical data flow architecture with a data sequencer.
Module Sequencer Architecture
Similar to other DPRS architectures, the target module sequencer architecture has a statically configured part and a dynamically reconfigurable part. As shown in Figure 1 
RMS Design
As illustrated in Figure 2 , the RMS has nine components, including three internal storages, four controllers, a bus state monitor, and an input decoder. The storages include a command pool (CP) that stores the chain commands, a data FIFO (DF) that caches the input data, and a slot table (ST) that records the state information for each slot. The state of a slot includes the mapping between logical and physical ID, the usage status, and the execution status if it is configured. The command pool controller (CPC) accesses the CP, stores chains into it, and selects an enabled data transfer request to be executed from some chain. The memory con- 
RMS Control Flow
The interaction between the RMS and the processor is triggered when the processor sends three types of commands to the RMS. The ID mappings are stored in ST by SC, the chain commands are stored in CP by CPC, and the input data are stored in DF by MC. The CPC checks for data transfer requests in CP and selects a request belonging to the chain with highest priority. To execute a data transfer request, the CPC queries the SC to check if the hardware of the requested function is configured in some slot and queries MC to check if its input data are available in the DF if it is the first data transfer request. If the responses from the SC and MC are both positive, then the CPC notifies the SC to assert the read and write control signals of the corresponding functions and the MC to transfer data. Otherwise, execution is postponed if the requested hardware function or the data of the selected chain are unavailable. In this case, the CPC selects another request to execute.
When the SC receives a function query signal from the CPC, it refers to the ST to check if the corresponding functions are configured. If configured and unused, SC acknowledges that the requested function is ready. When the SC receives a function execution signal, it asserts the write signal ofthe sender and the read signal ofthe receiver. Ifthe sender is SW, the SC enables the tri-state buffer as shown in Figure 2 , so that the data of the chain sent by the MC can be transferred on the bus. If not configured or all physical instances are busy, the SC acknowledges microprocessor that the requested function is unavailable.
When a data transfer is finished, the corresponding functions assert the busy signal indicating that the data bus is free for another transaction. The CPC initiates the execution of another data transfer request. The programming model for MSA tries to follow a conventional one so that a user need not learn a new programming method. As shown in Figure 3 , given a user program and corresponding task profile information, the chain generator determines the hardware-software partition. The hardware-software task constructor reorganizes the user program by replacing selected loops with RMS driver system call invocations and synchronization and buffering constructs. The result is a modified program called the chained program.
To support the execution of chained programs in MSA, an operating system for chained programs (CPOS) is required. Besides being an operating system for reconfigurable systems (OS4RS), CPOS has a system call that allows chained programs to send a request for executing a chain through the RMS driver. CPOS also has a hardwaresoftware task scheduler, a hardware function block placer, a driver for the configuration controller, and other I/0 device drivers. It must be noted here that the allocation, management, placement, and scheduling ofreconfigurable hardware function blocks are all performed by CPOS, which means the RMS is only responsible for executing a chain request by coordinating the data transfers between blocks and between the processes and the blocks. The development of CPOS is still an on-going work and requires further design and implementation.
Performance Model
To evaluate the effectiveness of our proposed RMS architecture, we will compare RMS with the processorcontrolled network architecture. Since we are improving We assume the context switch time to be negligible in the following evaluation. We compare four different architectures depending on the use of RMS and DMA. The number of cycles a processor must expend in handling data communication for the task is as follows. (to + tn) x k + (n + 2) In Table 2 , we compare the total system execution time for 5 configurations: (1) a single JPEG chain, (2) a single DES chain, (3) a JPEG with DES chained sequentially, (4) a low priority JPEG chain running in parallel with a high priority DES chain, and (5) a high priority JPEG chain running in parallel with a low priority DES chain. Data size for JPEG and DES are both 1, 024 bytes. Comparing configurations (3), (4), (5), we observe that configuration (3) has the worst performance because the functions are executed sequentially, in a single chain. Compared to configuration (5) and all other configurations, configuration (4) gives the best performance. This is because the most time consuming chain such as DES here is given the highest priority in RMS. 
