Abstract-This paper will describe a systematic method to map synchronous digital systems into dynamically reconfigurable programmable logic (i.e., programmable logic able to swap in real time the configuration defining the functionality of the system). The method is based on a temporal bipartitioning technique that is able to separate the static implementation of a circuit in two temporal independence hardware contexts. As the experimental results show, the method is capable of improving the functional density of the dynamic implementation with respect to the static one.
Example 1, StReAm, applies the object-oriented design methodology to high-level programming of data streaming applications. While conventional CAD/compiler systems for FPGAs make it very difficult to explore arithmetic optimizations, StReAm offers the flexibility to adapt the number representation, precision, and arithmetic algorithm to the particular needs of the application. Example 2, BSAT, enables us to quickly explore architectures and algorithms for solving Boolean satisfiability problems on FPGAs. By combining industry-strength state machine optimization with object-oriented module generation and placement, BSAT offers fast design time, high flexibility, and high performance of the final designs.
Current limitations of our design environment are: In case the generated design does not fit on one FPGA, spatial and/or temporal partitioning must be done manually. Although our framework facilitates automatic spatial partitioning onto multiple FPGAs on the C++ level, temporal partitioning is left for future work.
ACKNOWLEDGMENT
The author would like to thank H. Hübert for helping with the implementation of StReAm and L. Séméria for discussions on the draft of this paper.
I. INTRODUCTION
Custom computing machines (CCM) can be used for many computational intensive problems. Architectural specialization is achieved at the expense of flexibility because they are designed for one application and are inefficient for other application areas. Reconfigurable logic devices achieve a high level of performance on a wide variety of applications. The hardware resources of the device can be reconfigured once a function is completed for another different one, achieving a high level of performance on a wide variety of computations. A step forward is run-time reconfiguration or dynamic reconfiguration, where hardware resources can be reconfigured while the system is in normal execution, increasing the efficiency of the system.
A large number of applications has been proposed and developed, exploiting the advantages of dynamically programmable logic devices. Among them we can cite reconfigurable processor units (RPUs), adaptive signal processing, self-repairing and evolvable hardware, artificial neural networks, and others. Another perspective to take profit of dynamic reconfiguration capabilities is to partition a large computing design on a limited set of hardware resources, so that a partition is being executed while another partition is being configured, improving the functional density of the implementation.
Although the field of new applications offered by these new devices is wide, the main limitation that has prevented their extensive use in industrial applications is the lack of design tools able to handle automatically dynamic reconfiguration strategies.
Traditional partitioning techniques (KL, KLFM, and variants [1] ) are not suitable for an effective temporal partitioning, because they do not take into account the temporal dependencies between functional blocks. Many of the temporal partitioning techniques can be categorized as high-level or operation-level partitioning [2] - [4] . They use scheduling techniques and they assume that the reconfiguration overhead is large compared with the task graph execution time. The reconfiguration overhead includes the time required to load a new configuration from off-chip memory and the time to store and retrieve data from a memory for the buffering between partitions. The gate-level temporal partitioning techniques [5] - [9] assume a reconfiguration time of the same order than the task execution time, because multiple on-chip configuration bits are stored in the internal memory of the reconfigurable device (multicontext device). The algorithms [5] - [7] use a scheduling technique for the processing of the partitioning, while those in [8] and [9] use a scheduling combined max-flow technique with bounded partitions for the processing. The technique [5] models the circuit as a Mealy state machine, and it uses a dedicated hardware memory (micro-registers) for the buffering between contexts, while those in [6] - [9] use a time-multiplexed communicating logic (TMCL) model for the abstraction of the hardware resources needed for the buffering.
In this paper we shall introduce a gate-level bipartitioning technique that is able to separate a static implementation in two temporal independent hardware partitions suitable for a multicontext device, using their standard registers for the buffering. The presented algorithm uses a directed hypergraph for a partitioning technique based on minimizing the number of buffer resources while it balances the partitions sizes. The experimental results performed on standard benchmark suites demonstrate that the proposed algorithm produces a final physical implementation with an improved functional density.
The organization of the paper is as follows: In Section II, we describe the internal organization of the dynamically reconfigurable FPGA (DRFPGA) of the field programmable system on a chip (FIPSOC) device, making special emphasis on its suitability for providing dynamic reconfiguration features. Section III is devoted to the detailed definition of spatial partitioning and temporal partitioning principles, their main characteristics, and the concepts of functional density, virtual circuit, micro-cycle, and macro-cycle. In Section IV, is introduced a systematic technique to transform a synchronous digital circuit from its (static) implementation to a dynamic implementation. This technique will be used as base of the proposed temporal bipartitioning algorithm. In Section V, the experimental results are presented which have been obtained with the algorithm using a set of MCNC benchmark circuits. Finally, the conclusions are outlined in Section VI. 
II. THE FIPSOC ARCHITECTURE
The FIPSOC [10] , [11] is a new family of system on chip (SOC) devices. This device includes a 8051 microcontroller core, a DRFPGA, a configurable analog block (CAB) which includes a programmable A/D, D/A acquisition/conversion section, RAM memory for the configuration bits and/or for general purpose user programs, and some additional peripherals. The building blocks of the DRFPGA have been termed DMC's (digital macro cells). The DMC, whose internal structure is depicted in Fig. 1 , is a large granularity, 4-bit wide programmable cell that contains a combinational block and a sequential block interconnected through an internal router. Both combinational and sequential blocks have 4-bit outputs plus some extra outputs for macro modes. The DMC output block has four bit outputs individually multiplexing combinational or sequential outputs, plus two auxiliary outputs.
Both combinational and sequential blocks can be configured in static or dynamic modes. The dynamic mode provides two temporal independent contexts that can be swapped with a single signal. The contents of the registers are duplicated so that they can be stored when the active context is changed and restored back when the context becomes active again. Also each configuration bit is duplicated, so as to provide dynamic reconfiguration capabilities. As depicted in Fig. 2 , there is a physical isolation between active configuration bits and the mapped memory, which can be used for storing the configuration of the two contexts, or as a general purpose RAM once its contents have been transferred to the active context. A new scheme for reconfiguring DMCs was introduced to the FIPSOC circuitry enabling a fast reconfiguration time. Each DMC has two inputs (IA3 for triggering the context swap and IB3 for indicating the context index) accessible for any of their outputs able to trigger a new context swap, reducing context swap time to just one clock cycle.
III. PARTITIONING AND DYNAMIC IMPLEMENTATION

A. Spatial and Temporal Partitioning
Partitioning techniques focus on large electronic designs, where the entire circuit functionality cannot be physically mapped into just one device. These techniques physically divide the design in two (bipartitioning) or more (multiway partitioning) circuits, being the goal to minimize the communication signals between partitions (cutsize) while balancing partition sizes with a given tolerance.
Most of the traditional partitioning algorithms (i.e., KL, KLFM, and their variants [1] ) are spatial partitioning techniques. Each partition is placed in a separated physical area (spatial independence) sharing the same time slot (temporal dependence) and joined using their common signals. The cutsize of the partitioning is defined in terms of the number of shared signals (cut-signals) between spatial partitions. The directions of the cut-signals are not taken into account.
A temporal partitioning algorithm will place partitions in separated time slots (temporal independence) sharing the same physical area (spatial dependence), joined using their common signals. The cutsize of the partitioning is defined in terms of additional buffer resources needed to communicate shared signals (cut-signals) between temporal partitions. The directions of the cut-signals must be considered in order to preserve a principle of temporal independence between partitions. Fig. 3 (a) presents a large synchronous digital design (V ) to be partitioned using a spatial bipartitioning algorithm or a temporal bipartitioning algorithm. Its physical size is given in terms of resources area (A) and its temporal size is given in terms of the clock cycle (T ), determined by the delay of the critical path.
A functional density parameter D was introduced in [12] to measure the overall advantage (or disadvantage) of an partitioned implementation over the static one. Functional density measures the computational throughput for a hardware circuit area and is defined as the inverse of the cost function D = 1=C, traditionally measured as C = A 1 T , where A is the area of the implementation and T is the time required for a computation. A better functional density means a more intensive use of hardware resources for a fixed time. Additionally, a parameter able to indicate the improvement in the functional density is defined in [12] as the relationship between the functional density of the partitioned implementation over the functional density of the static implementation, in relative terms with respect the static one. This factor is expressed as and a negative improvement factor (I < 0).
An optimum temporal bipartitioning technique [ Fig. 3 (c)] will divide V into two balanced temporal independent partitions X and Y , sharing the same physical area and communicated by a unidirectional cut. Each partition will be one-half the size of the original area. In reducing the complexity of partitions, its critical path and delay can be reduced (one-half of the static delay if partitions are delay-balanced), and there is an additional time cost (T c) due the reconfiguration, resulting a total delay greater than that of the static implementation. The cost of this temporal partitioned implementation (or dynamic implementation) can be lower than the static one (C c < C a ), resulting in a functional density greater than the original (Dc > Da) and a positive improvement factor (I > 0). If the reconfiguration time is further reduced an improvement factor with value close to 1 could be achieved.
B. The Virtual Circuit
A dynamic reconfigurable FPGA allows a large circuit (the static implementation) to be implemented on a smaller physical device, sharing the hardware resources with a time-multiplexed scheduler and dynamic reconfiguration, improving the functional density factor. The dynamic implementation is also named virtual circuit because of the analogy between it and a virtual memory, where a program can be larger than the physical memory, mapping portions of the virtual memory to the physical memory when needed [8] .
The process to follow is to execute a temporal partitioning of the circuit to obtain a set of temporal independent partitions, so each partition is executed in a context of the hardware resources of the DRFPGA independently from the others during a micro-cycle [5] . One context will be the active context in the programmable logic, while the remaining ones will be virtual contexts (their configuration bits are in the on-chip memory, but not yet used) during a micro-cycle. In the next micro-cycle a virtual context will be the new active one and the old active context will become virtual, and so on. The ordered set of micro-cycles is a macro-cycle and it corresponds with the clock-cycle of the static implementation of a synchronous circuit. When the last micro-cycle has finished a new macro-cycle will be executed. Each context will provide a set of cut-signals (communication signals to the next contexts) as a function of their primary inputs (the inputs of the virtual circuit) and other cut-signals (outputs from others contexts).
A virtual circuit will be equivalent to its static implementation if it produces a set of output vectors (at the end of the macro-cycle) identical to the set obtained with the static implementation (at the end of the clock cycle) for any set of input vectors.
The reconfiguration scheme included in the FIPSOC device permits us to share values of signals between different time-windows using their flip-flops/latches without taking extra time. The only extra time penalty will be given by the need to change the active context, minimizing extra time in the cut. The dynamic implementation in their hardware resources implements two contexts, but the study of the benefits of adding contexts versus the area cost concluded than the number of contexts on a DRFPGA should be small [6] , [7] because DRFPGAs with more contexts need to devote more area to their communications.
IV. THE TEMPORAL BIPARTITIONING ALGORITHM
Spatial partitioning algorithms [1] work with a graphical representation of the circuit netlist. The representation is a graph G = fV; Eg or hypergraph H = fV; Eg, containing a set of vertices V (representing the logic functions) and a set of edges or hyperedges E (representing their relationships) but without information about their temporal dependencies. In order to preserve the temporal independence between partitions, the bipartitioning algorithm must take into account the temporal dependence between their vertices by means of directed edges or hyperedges. The algorithm will generate unidirectional cuts that will show the order of the micro-cycles sequence inside a macro-cycle.
For the proposed temporal bipartitioning algorithm we use a directed hypergraph H = (V; E), with a set V of combinational and sequential vertices, and a set of directional hyperedges E . Any hyperedge e v = (v 0 ; v 1 ; v 2 ; . . . ; v n ) contains one source vertex, v 0 , and one or more destination vertices, v1; v2; . . . ; vn . Using a directed hypergraph to model the circuit netlist we avoid the use of twin edges (one edge for each direction) and dummy vertices used in [8] and [9] .
A correct temporal bipartitioning is obtained if it accomplishes the temporal independence principle between contexts. That is, all the primary inputs of all vertices of a context have been calculated previously. To develop the proposed bipartitioning method we consider separately combinational and sequential hyperedges.
A. Combinational Hyperedges
A combinational hyperedge is the one whose source is a combinational vertex. The temporal independence for a combinational hyperedge must ensure that its source vertex has been executed before all of its destination vertices, either in the same or in a previous micro-cycle within the same macro-cycle. We have defined the function m(v) that will provide us the micro-cycle index where the context containing the vertex v is placed and a function M (e) that will provide us the macro-cycle index where the valid value of the hyperedge is obtained.
The principle of temporal independence of a bipartitioning for a combinational hyperedge is resumed in the following: if v 1 is a sequential vertex. For the vertices v 2 and v 3 the value of the hyperedge must be transferred from m1 to m2 within the same macro-cycle (M T ) using an additional resource in the unidirectional cut. Like v1 , the valid outputs of v2 and v3 will be obtained in the same macro-cycle (MT ) or in the next macro-cycle (MT +1) depending upon whether they are combinational or sequential vertices. This combinational hyperedge adds one cut-signal to the unidirectional cut of the bipartition. This bipartition accomplishes the time independence principle for a combinational hyperedge, because when all the inputs of a context are available from a previous micro-cycle, the context can be executed in its micro-cycle independently from the other context. When a combinational hyperedge has one or more of their destination vertices in a different micro-cycle than its source vertex, it will be a combinational cut-signal of the bipartition. Every combinational cut-signal must transfer its value for a macro-cycle MT from the source micro-cycle m 1 to the next micro-cycle m 2 using an additional buffer resource. We use a standard FIPSOC D-Latch in static mode, using as its control signal the context index signal, I , available in the FIPSOC device, as depicted in Fig. 6 . While the virtual circuit is executing the first micro-cycle m1 of a macro-cycle, I = 0 and the D-Latch will store the value driven by the source vertex c 0 . In the next micro-cycle m 2 of the same macro-cycle M T , the signal will change to I = 1 and the D-Latch will latch the previously stored value driving the input of the destination vertex vj . In the next macro-cycle MT +1, the value stored in the latch will change since its value valid for just one macro-cycle. This D-Latch in static mode will be engaged two micro-cycles, and we assign to this combinational cut-signal a weight of two.
In order to disable a cut-signal that does not accomplish the temporal independence, we assign an infinite weight to the hyperedge connecting a source from a later micro-cycle to a destination or destinations in a previous micro-cycle.
If a combinational hyperedge does not take part in the cut of the bipartitioning, it will not need any additional buffer resource because all the vertices are placed in the same context. We assign a weight of zero to this hyperedge.
To calculate the weight of the hyperedge ec = (c0; v1; v2; . . . ; vn) it is necessary to compute the weights of all their building edges e c ;v = (c 0 ; v j ), (j = 1; 2; . . . ; n), as depicted in the Fig. 7 .
The weight of a combinational hyperedge is given by the additional hardware resources (latches) to be added in both contexts in order to maintain the functionality of the virtual circuit, and can be expressed as The weight of a hyperedge may change when the source vertex or a destination vertex is placed in different context. This approach for assigning weights to the hyperedges must be done dynamically depending on the situations explained above, differing from spatial partitioning [1] and the temporal partitioning algorithms [8] , [9] where the weights of the edges are static. 
B. Sequential Hyperedges
A sequential hyperedge is the one whose source is a sequential vertex. The temporal independence for a sequential hyperedge must ensure that its source vertex has been executed previously to their destination vertices, in any micro-cycle within the previous macro-cycle. Using the function M (e), the principle of temporal independence of a bipartitioning for a sequential hyperedge is resumed in the following condition:
M (e v ) = M ( 8 e v j e v = (v k ; . . . ; v 0 ; . . . v n )) 0 1: (4)
In Fig. 8(a) there is a representation of a sequential hyperedge es = (s 0 ; v 1 ; v 2 ; v 3 ), whose source is a sequential vertex s 0 , executed in a macro-cycle M T , and three destination vertices, v 1 ; v 2 ; v 3 . Although the sequential vertex is executed in a macro-cycle MT , its valid result will be available in the next macro-cycle M T +1 .
In Fig. 8(b) we have depicted a correct temporal bipartitioning. The vertex v 2 will be executed in the same micro-cycle m 1 as the source s o , but its valid result will be obtained in the next macro-cycle M (ev ) = T +1, if v 2 is a combinational vertex, or in the second next macro-cycle M (ev ) = T + 2, if v2 is a sequential vertex. For the vertices v1 and v3 their results will be obtained in the other micro-cycle m2 , but one or two macro-cycle later depending if they are combinational or sequential vertices, the same as v2 . This sequential hyperedge adds one cut-signal to the unidirectional cut of the bipartitioning. This bipartitioning accomplishes the time independence principle for a sequential hyperedge because when all the inputs of the context are available from a previous macro-cycle, the context can be executed in its micro-cycle independently from the other context.
In Fig. 8(c) we have depicted an incorrect temporal bipartitioning. The inputs of the vertices placed in the macro-cycle MT have not been calculated in a previous macro-cycle. This solution does not accomplish the temporal independence principle because the inputs of the virtual circuit for a macro-cycle are not available.
Because the value of every sequential hyperedge must be transferred to the next macro-cycle, it must be added to a standard FIPSOC D-FF in dynamic mode for every micro-cycle that the hyper-edge must cross to reach a destination vertex. The clock input of the D-FF is the microcycle clock, which is the same clock signal that is used in the sequential D-FF will be placed in micro-cycle m1 for the transport through m1 of the macro-cycle M T +1 . Fig. 9 depicts a sequential edge with its source and destination vertices placed in the three situations explained above.
To calculate the weight of the hyperedge es = (s0, v1 , v2 , ..., v n ) it is necessary to compute the weights of all their building edges e s ;v = (s 0 ; v j ); (j = 1; 2; ...;n), as depicted in the Fig. 10 . The weight of the hyperedge is given by the additional buffer resources (D-FFs) to be added in both contexts to maintain the functionality of the virtual circuit, and can be expressed as 
C. The Temporal Bipartitioning Algorithm
Bearing in mind all the previous considerations, the proposed temporal bipartitioning algorithm for synchronous circuits has been developed as a generalization of the algorithm for combinational circuits proposed in [13] . A diagram of the proposed algorithm is depicted in Fig. 11 . An initial solution is generated by placing in the first context all vertices that accomplish the temporal independence principle. These are the vertices whose inputs are exclusively system inputs or sequential hyperedges, placing the rest of the vertices in the second context. An iterative process is executed, which consists of selecting and trans- ferring vertices from the larger context to the smaller one until context sizes are balanced, taking into account the additional area of the buffer resources. The strategy for selecting vertices is based on a full search on the larger context of the vertex that will minimize the cost of the bipartitioning (i.e., number of buffer resources).
To improve the execution speed of the algorithm (determined mainly by the SelectVertex() procedure represented in Fig. 11 ) we have implemented a vector of weights for every vertex, with an invalidation technique and an incremental buffer resources calculation.
Every vertex is associated to a vector of weights of their associated hyperedges (i.e., the output hyperedge and their input hyperedges), named vector of hyperweights. Each hyperweight element of the vector is obtained by the sum of the weights of the associated hyperedges for a vertex v 0 placed in a context m t , and it is expressed as to (6) . In the same way we calculate the hyperweight when the vertex s 0 is placed in the destination context m 1 . The gain of the cost for the bipartitioning, if the vertex s0 is moved from the context m2 to m1, is calculated using (8) and it will result in a negative number, 01. These fact means that if the vertex s0 is selected to be transferred from the context m2 to m1 the cost of the bipartitioning will be reduced in one buffer resource.
The weight of an hyperedge may change when the source vertex or a destination vertex is transferred to another context. When a vertex is moved from the source context to the destination one, the appropriate hyperweights are declared invalid, but not calculated immediately. When the SelectVertex() procedure is in execution, it will use a previously calculated hyperweight if it is not invalidated, otherwise it will be calculated using (6) . We have implemented an incremental calculus of the buffer resources in both contexts to take into account the area required for their allocation.
V. EXPERIMENTAL RESULTS
A set of test circuits has been chosen to test the performance of the proposed temporal bipartitioning algorithm. These circuits belong to the MCNC suites ISCAS'85 [16] and '89 [17] , and they were automatically translated to BLIF netlist and optimized and technology mapped for LUTs and D-FFs with the SIS environment [14] , resulting in a new BLIF netlist used as input of the algorithm providing the bipartition results.
The algorithm is an enhanced version of that presented in [15] . It was developed in C++ and it implements a derived class of a technology base class to implement the mapping for the LUTs and D-FFs on a FIPSOC device, so it can be easily ported to other technologies (based on DRFPGAs with multicontext FFs and Latches) with a new derived class for the specific device. The experimental results obtained for the benchmark circuits are presented in Table I . The first column is the circuit name. The size of the circuit is presented in terms of the number of LUTs and D-FFs (second column), the number of inputs, outputs, and internal nets (third column). The number of cut-signals obtained is presented in the fourth column, and the associated number of buffer resources in terms of D-Latches and D-FFs for the context 1 and 2 is presented in the fifth column. The parameters of the static implementation of the benchmark circuit are in the area (in terms of number of DMCs) placed in the sixth column and the delay (estimated in terms of the maximum logical depth) is the seventh column. The parameters for the dynamic implementation obtained by the algorithm are represented in the eighth column for the area, and in the ninth column for the delay. The area of a context is computed based on the number of DMCs required to map the LUTs and the buffer resources, while the delay time is computed in terms of the maximum logic depth. The area of the bipartition is the maximum area of both contexts, while the delay is the sum of the delay time for both contexts. Finally, there is the improvement factor D in the tenth column and the total run time (indicated in seconds) obtained on a AMD K6/2-500MHz PC, 128 MB main memory executing a Windows'98 OS, in the last column.
These results show that some designs are very good candidates to be implemented dynamically. Some circuits reach a large improvement factor near the ideal 100%, which corresponds to a perfect temporal bipartitioning. For circuits with a lower improvement factor, dynamic implementation is still an interesting alternative to the static one, since they will require fewer hardware resources although the delay will be increased. In this case a more economic implementation is possible for applications where the execution time is not the most critical parameter.
A direct comparison of the proposed algorithm with the eFDS algorithm [6] and the -bounded network flow algorithm [8] is not possible because the synthesized circuits are different, and they use a parameter T CM to measure the number of buffer resources as the maximum of CMs (communication modules) in the partitions. For this reason we use the ratio between number of buffer resources (in both contexts) and the number of signals. To estimate the number of buffer resources in both contexts for a bipartition using [6] and [8] we multiply by 2 the T CM parameter, but for our algorithm we sum all the buffer resources. The obtained results are presented in Table II . The first column is the circuit name. The number of LUTs and FFs of the synthesized circuit used in [6] and [8] are presented in the second column, and the number of nets is in the third one. In the forth and sixth columns are the TCM parameters for [6] and [8] , respectively. The ratio between buffer resources and nets for [6] and [8] is placed in the fifth and seventh columns. The synthesized circuit for the FIPSOC technology, used in our approach, in LUTs and FFs is presented in the eighth column, and the number of nets is in the ninth column. The number of obtained buffer resources is in the tenth column, and the ratio between buffers and nets is in the last column. Using this approach we show than the proposed algorithm needs in all cases, except one, fewer buffers (measured in relative terms with respect to the number of signals) than the algorithms presented in [6] and [8] .
VI. CONCLUSION
The presented temporal bipartitioning technique uses standard hardware resources (multicontext registers) for the implementation of the buffering between contexts, avoiding the use of specialized hardware resources used in another techniques [5] . Although this temporal bipartitioning technique has been tested with the FIPSOC architecture, it can be adapted to any DRFPGA, with multicontext registers (independent output states for each context). The algorithm developed from [6] AND THE NETWORK-FLOW [8] this technique can be easily adapted to another, deriving their specific technology classes for the mapping of the hardware resources. The experimental results obtained by applying this algorithm to a set of benchmark circuits show that the dynamic implementation of a circuit reduces the required area, compared with its static implementation, although the delay is increased. These facts indicate that the dynamic implementation of a design should be taken into account in applications when delay is the critical parameter. Otherwise, the proposed algorithm obtains a better average ratio between buffer resources and nets, compared with the other temporal partitioning algorithms [6] , [8] .
