In this paper, a high-performance reconfigurable coarse-grain data-path, part of a hybrid reconfigurable platform, is introduced. The data-path consists of coarse-grain components that their flexibility and universality is shown to increase the system's performance due to significant reductions in latency. A methodology of unsophisticated but efficient algorithms for mapping computational intensive applications on the proposed data-path is also presented. Results on Digital Signal Processing and multimedia benchmarks show an average execution cycles reduction of 20%, combined with an area consumption decrease, when the proposed data-path is compared with a highperformance one. The average cycles reduction is even greater, 44%, when the comparison is held with a data-path that instantiates primitive computational resources on FPGA hardware.
Introduction
Modern applications, such as multimedia and wireless telecom ones, are characterized by high complexity and diverse functionality, while they require high performance and lowpower consumption. Reconfigurable computing can bring together the flexibility offered by the software to the high performance of the hardware (Refs. [1] [2] [3] . In hybrid (mixed) granularity reconfigurable platforms the performance is increased and the power consumption is decreased due to the presence of coarse-grain units (Refs. [1] [2] [3] , when these are used instead of fine-grain reconfigurable (FPGA) units. Research activities in High-Level Synthesis (HLS) (Ref. 4) and in Application Specific Instruction Processors (ASIPs) (Refs. 5, 6, 7) have proven that the use of complex data-path resources, called templates or clusters, instead of primitive ones (as single ALUs and multipliers) improves the performance of computational intensive applications, like Digital Signal Processing (DSP) and multimedia ones. A template may be a specialized hardware unit or a group of optimally-designed chained units. Chaining is the removal of the intermediate registers between the primitive units; thus improving the total delay of the units combined.
In this paper, a high-performance reconfigurable coarse-grain data-path for implementing computational intensive applications (or parts of an application) is introduced. It consists of a set of Coarse-Grain Components (CGCs) implemented in ASIC technology, a reconfigurable interconnection network, and a register bank. The data-path is part of a hybrid reconfigurable System-on-Chip (SoC) platform, illustrated in Figure 1 . The platform is also composed by a microprocessor, an embedded FPGA (finegrain reconfigurable logic), data and program memories. The microprocessor executes parts of the application, derived after a hardware/software partitioning stage, which are usually non-computational intensive ones. The FPGA logic typically executes small bitwidth operations, since these match its granularity. Also, it hosts the control-unit of the CGC data-path. Each CGC is an nxm array of nodes, where n and m are the number of rows and columns, respectively, and it is able to realize complex operations, like multiply-add ones. Direct inter-CGC connections exist to fully exploit chaining among nodes of different CGCs in contrast to existing template-based methods, as in Refs. [4] [5] [6] [7] . Also, the introduced CGC data-path can be efficiently used when a temporal partitioning approach, as in Ref. 8 , is adopted. Furthermore, the application mapping is accommodated by simple, yet effective, algorithms.
The rest of the paper is organized as follows. Section 2 presents the related work in the hybrid reconfigurable architectures and in high-performance template-based data-paths. The motivation for developing the CGC data-path is explained in section 3.The CGC data-path is described in section 4, while section 5 presents the features arisen from the design of the data-path. The mapping methodology for the proposed data-path architecture is described in section 6. Section 7 presents the experimental results. Finally, concluding remarks and future work are drawn in section 8.
Related Work
Ref. 3 presents the Pleiades SoC architecture which is a hybrid reconfigurable platform that combines an on-chip microprocessor with a number of heterogeneous reconfigurable units of different granularities connected via a reconfigurable interconnection network. These units are mainly MACs, ALUs, and an embedded FPGA. Although promising results, especially in power-consumption, have been reported, Pleiades suffers by a major drawback. In particular, no systematic automated methodology exists for mapping an application or a family of applications on its architecture.
The Strategically Programmable System (SPS) is another hybrid reconfigurable SoC platform described in Ref. 5 . The coarse-grain units are defined by means of template generation from a set of Data Flow Graphs (DFGs) of the application domain. The clustering of operations is based on the frequency of appearance of successive operations. They observe that the number of operations per cluster is small and conclude that simple pairs of operations are the best candidates to be used as templates. However, the reported templates are not flexible resulting in a non-full and optimal cover of the DFG. Thus, as the uncovered operations are implemented by FPGA units, the system's performance is reduced and the power consumption is increased.
The hybrid reconfigurable approach has been recently adopted in commercial FPGA devices, like the Xilinx Virtex-II/4 (Ref. 9) and Altera Stratix (Ref. 10). These devices contain ASIC multiplier units, which operate on 18-bits operands and they can be considered as the coarse-grain hardware. However, additional operations such as ALU operations are appeared in the DSP and multimedia kernels. Since these operations are realized by the fine-grain reconfigurable logic, the performance is reduced compared to a full covering of the DFG operations by the CGCs, which are implemented in ASIC technology.
Ref. 4 , assuming a given library of optimally designed templates, proposed a method for selecting the proper templates to realize the application's DFG and thus improving the performance. The reported results demonstrate high performance gains with an affordable area increase. Although many optimization techniques were utilized as part of the synthesis strategy, template selection had the largest impact on overall improvement in performance. Also, this method does not fully exploit chaining; thus it causes an increase in latency as it will be shown in section 3. Latency is the number of clock cycles to execute the entire schedule of an application (Ref. 11).
Ref. 6 presents algorithms for identifying clusters of dataflow operations to be implemented as application-specific instructions for existing configurable (extensible) processors. The authors define the candidate extended instruction to be a convex directed acyclic subgraph, which is defined as a cut, with certain input and output constraints. They use a branch and bound method to identify a single cut in a basic block with maximum speedup. However, the complexity of the branch and bound algorithm grows very fast, when the number of instructions becomes large. Also, the objective to maximize the sum of speedup of each individual cut may not always result in the minimum execution time, thus not achieving high-performance.
In Ref. 7, the problem of generating application-specific instructions for configurable processors that aims at improving delay is addressed. The instruction (template) generation considers only Multiple-Input Single Output (MISO) format, which cannot take into advantage register files with more than one write port. The pattern library is selected for maximizing the potential speedup, subject to a total area constraint. Nevertheless, this does not exclude the generation of a large number of different patterns, which complicates the step of application mapping. The mapping stage is an NP-hard problem with high complexity.
Motivation for the Data-Path Development
In existing template-based methods (Refs. 4-7), a large number of templates is usually required for implementing the application's DFG. When the number of templates becomes prohibitive, it has been proposed that part of the DFG is covered by ASIC or FPGA primitive resources (Ref. 5) , but this has an impact on both performance and power dissipation. This large number of templates complicates the application mapping process to the data-path, but most importantly prevents inter-template connections, through an interconnect network. The complexity of an interconnect network, which increases the delay in the data-path, scales as the number of computational resources in the data-path increases. So, the performance is decreased since we do not benefit from the inter-template chaining of operations as the communication among the templates is enabled through the register bank. The proposed CCG data-path tackles the above issues. Consider the example in Figure 2 . Let as assume that both the templates (Figure 2a ) and the DFG nodes (Figure 2b ) perform two-operand computations. The primary inputs of the DFG and templates are omitted for clarity reasons. It is assumed that inter-template connections are enabled. From Figure 2a , it is deduced that there is a similarity regarding the computational structures appeared in the three control steps (c-steps). In particular, the sub-graphs at the first two c-steps (C1 and C2 steps) have the same structure and differ to the performed operations. In the third c-step, four pairs of operations (subgraphs) exist that differ in the operations performed. In order to achieve minimum latency, the DFG is implemented using 8 template instances, as shown in Figure 2b . This number of templates is required since the type of templates in Figure 2a can not exploit the similarity of operations between the c-steps of the DFG. However, as it will be shown in the following, the use of the universal and reconfigurable CGC allows implementation with fewer component instances.
An example regarding the optimal chaining exploitation, through direct inter-template connections, is shown in Figure 3 . If direct inter-template connections are permitted (as in the case of the CGC data-path) the chaining of operations is optimally exploited and the DFG is executed in one cycle (Figure 3a) . On the other hand, if direct inter-template connections are not allowed, as in Refs. 4-7, the result of operation b is produced and stored in the register bank at the first clock cycle and it is consumed at the second cycle. Thus, the DFG is now realized in two cycles (Figure 3b ), resulting in performance degradation. The aforementioned problems (large number of computational resources and latency increase) of existing template-based data-paths are encountered by designing a flexible reconfigurable component able to implement any desired type of template. This component is the CGC, which is an nxm array of nodes (n stand for the rows and m for the columns), where each node can implement an ALU operation or a multiplication in a c-step. So, the functional units (FUs) inside a CGC node are a multiplier and an ALU. To select these types of FUs, we have assumed that the operations in the DFG are either ALU or multiplication, which is the usual case in DSP applications. In case, where special operations (e.g. square root) are required, these can be implemented: (a) by specific hardware units, which will operate as co-processor units to the proposed datapath, or (b) by emulating these operations by a series of ALU operations and multiplications, where this is applicable.
An abstract schematic of the 2x2 CGC is shown in Figure 4a , where each node performs a two-operand (ALU/multiplication) operation. The steering logic (multiplexers and tri-state buffers), which controls the data transfers from one CGC node to another, is not shown for clarity reasons. Due to the universality and flexibility of the CGC, the DFG of Figure 2 is implemented (covered) by two CGCs (Figure 4b ) instead of eight template instances as in case where inflexible templates from a library are considered (Figure 2b) . Thus, the DFG is implemented with reduced area, while the existence of a small number of reconfigurable and identical components (i.e. the CGCs) permits direct intercomponent connections and thus inter-component chaining exploitation. On the other hand, the large number of templates required for the DFG covering (Figure 2b ) prohibits the design of an interconnect network since such an interconnect would have caused a delay (i.e. clock cycle period) increase and respective performance degradation in the data-path. Without inter-template connections, the DFG of Figure 2 is executed in 4 clock cycles, instead of 3 cycles if there was direct connectivity among the templates. In the following sections of this paper, the proposed CGC data-path, its characteristics and the mapping methodology of applications to this data-path, are discussed in detail.
CGC Data-Path Architecture
The proposed data-path consists of: (a) the CGCs, (b) a register bank, and (c) the reconfigurable interconnect network, which enables the inter-CGC connections and the connections between the CGCs and the register bank. The structure of the CGC is an nby-m (nxm) array of nodes, where n and m are the number of rows and columns, respectively. In Figure 5a , a 2x2 CGC that will be used to demonstrate the features of the introduced component, is shown. The 2x2 CGC consists of four nodes, four inputs (in1, in2, in3, in4) connected to the register bank, four additional inputs (A, B, C, D) connected to the register bank or to another CGC, two outputs (out1, out2) also connected to the register bank and/or to another CGC, and two outputs (out3, out4) whose values are stored in the register bank.
Each CGC node consists of two FUs that are a multiplier and an ALU as shown in Figure 5b . Both of these units are implemented in combinational logic to exploit the benefits of the chaining of operations inside the CGC. The flexible reconfigurable interconnection among the nodes in a CGC is chosen to allow in easily realizing any desired complex operation by properly configuring the existing steering logic (i.e. the multiplexers and the tri-state buffers). The ALU performs shifting, arithmetic (add/subtract), and logical operations. Each time either the multiplier or the ALU is activated according to the control signals, Sel1 and Sel2, as shown in Figure 5b . The databus width of the CGC is 16-bit, because such a bit-width is adequate for the majority of the DSP applications.
An nxm CGC has an analogous structure. Particularly, the first-row nodes obtain their inputs from the register bank. All the other CGC nodes obtain their inputs from the register bank and/or a row with a smaller index from the same and/or another CGC. For the case of the outputs, the last-row nodes store the results of their operations to the register bank. All the other CGC nodes fed their results to the register bank and/or to another CGC in the data-path. In previously published template-based methods (Refs. 4-7), templates with depth=2 and 1 ≤ width 2 ≤ (i.e. n=2, 1 ≤ m 2 ≤ for the CGC) were mainly used due to two reasons: (a) larger templates introduce larger area and control overhead relative to a primitive resource data-path (Ref. 4) , and (b) templates consisting of two operations in sequence contribute the most to the performance improvements (Refs. 5, 7). So, CGCs with a value of 2 3 n ≤ ≤ and 2 ≤ m ≤ 3 are adequate to be used for improving performance.
The reconfigurable interconnect network is divided in two sub-networks. The first one is used for the communication among the CGCs. The second one is used for the communication of the CGCs with the register bank for storing and fetching data values. A crossbar interconnection network can be used to provide full connectivity in both cases. When a large number of CGCs is required in the data-path (e.g. for supporting large amounts of parallelism), the crossbar network cannot be used since it is not scalable network. Instead of this, an interconnection scheme similar to fat-tree (Ref. 12), which is a hierarchical network extensively used in multiprocessors systems, can be adopted. In particular, an efficient interconnection network can be created by connecting CGCs into clusters. First, k CGCs are clustered and connected with a switch, which provides the full connectivity among the k CGCs. Then, k clusters are recursively connected together to form a supercluster and so on. A small value of k can result in large performance gains, as shown in Ref. 13 . Any two CGCs can be connected with fewer than 2log k N-1 switch devices, where N is the total number of CGCs in the data-path. So, the delay of the interconnections for implementing the full connectivity in a fat-tree network increases logarithmically with the number of CGCs (2log k N-1 switch devices are required) and not in quadratic manner as in a crossbar network, where O(N 2 ) switches are required. Although, a fat-tree like interconnection scheme can be also employed by the existing template-based methods (Refs. 4-7), the interconnect delay of the CGC-based data-path is smaller as it consists of a smaller number of regular and uniform hardware resources (the CGCs), than the computational resources of a data-path derived by existing templatebased methods. This is due to the fact that fewer switches are required for implementing the interconnection for the CGC data-path, since the number of CGCs is smaller than the template instances.
Features of the CGC Data-Path
Since, the CGC data-path targets DSP kernels, it can be considered as resource dominated circuit (Ref. 11) . In this case, the data-path's worst-case delay (critical path) is determined by the computational resources (the CGCs in our case) and not by the steering logic (multiplexers) or by the control-unit. Compared with an equivalent CGC functionality realized by templates, as in Refs. 4-7, the CGC's critical path increases due to the tri-state buffers and multiplexers. To have an indication for this increase an experiment has been performed. The delay of a 2x2 (2x3) CGC compared with a template consisting of two multipliers in a sequence, is increased by 4.2% (4.7%), when they are synthesized at a 0.13μm ASIC CMOS library. Thus, the performance improvements over the template-based data-paths are not negated as the measured % delay increase (i.e. the % increase in clock period) is significantly smaller than % reduction in clock cycles over the considered template methodology (see Tables 2 and 4) .
Although extra control signals are required to configure a CGC compared to a primitive or a template resource, the control-unit can be designed in such a way that does not incur an increase to the delay of the whole data-path. This can be achieved when control signals are grouped together to define a subset of the state of the control-unit in a c-step. This way of synthesizing the control-unit is supported by current CAD synthesis tools (Ref. 14) , where the control-unit can be automatically synthesized under a given delay constraint. For example, the extra control signals that are required to set-up a CGC (for the multiplexers and the tri-state buffers) can be grouped to form a subset. In this way, the delay of this control-unit is not increased compared to a data-path composed by templates and/or primitive resources. Nevertheless, the area of the control-unit increases. However, since our priority is high-performance and not area consumption, this area increase is not a major consideration in this work.
The introduced area redundancy, mainly due to the 2 FUs inside the CGC node, is a trade-off for achieving high-performance. The area overhead, relative to a data-path composed by primitive resources, is also the case of data-paths composed by template hardware units. In Ref. 4 , an average increase of 66% in area consumption was reported. Regarding the power consumption, each time an operation is covered by a CCG node, either the multiplier or the ALU is activated by properly controlling the corresponding buffers. When a CCG node is not utilized at a c-step, neither the multiplier nor the ALU are activated, thus reducing power consumption.
Temporal partitioning is a procedure for properly dividing an application into a number of temporal segments, which are executed one after another. As illustrated in Ref. 8 , for a DCT transform operating on 4x4 blocks of an input image, the temporal partitioning results in an average improvement of 35% compared to the static reconfiguration of the FPGA device. The CGC-based data-path provides the hardware support for the temporal partitioning of an application, as the FPGA devices do. This is due to the fact that a CGC node can implement all the DFG operations; so the selection of the proper CGC node operation is sufficient to implement every kind of operation in a temporal segment. This is not the case for a data-path composed by primitive resources and/or templates.
The ASAP leveling methodology of Ref. 15 can easily be adapted to the CGC datapath. So, it can be an automated methodology for the temporal partitioning of applications to the CCG data-path. However, the ASAP leveling methodology does not consider the resource sharing of functional units (like the CGCs). It has been proved in Refs. 8 and 16 that a temporal partitioning which is combined with resource sharing of computational resources reduces latency, when compared with approaches like the one in Ref. 15 . The development of an efficient temporal partitioning methodology, which takes into consideration the resource sharing of the CGCs, is a topic of future work.
Mapping Methodology
Due to the universal and reconfigurable structure of the CGC, a full DFG covering using only CGCs is easily obtained, while the application mapping is simplified. Also, a full exploitation of chaining of operations both inside and among CGCs, resulting in performance improvement over primitive resource and template-based data-paths, is achieved as shown in Figure 3 . The existence of a library consisting of only one type of resource (i.e. the CGC) further simplifies binding with the CGCs. On the other hand, to cover a DFG by a template-based data-path, the data flow structure and the type of the operations of the DFG portion have to be matched with a template available in the library, which results in a difficult matching problem. In previous template-based datapaths (Refs. 4-7), due to the inflexible structure of their templates, a large number of templates is usually instantiated. This prevents the design of an efficient inter-template network. Also, when partial matching (Ref . 4) is not supported by the available templates, as in Ref. 5 , the uncovered DFG operations have to be realized by primitive resources. This may result in an increase in delay, area, and power. For example, if the primitive operations are implemented in FPGA hardware, as in Ref. 5, the performance is degraded. This is justified in this paper's experimental results.
The mapping methodology consists of: (a) scheduling of DFG operations, and (b) binding with the CGCs. The input is an unscheduled DFG. For mapping Control Data Flow Graphs (CDFGs) the methodology is iterated through the DFGs comprising the CDFG of an application.
while (the DFG is not fully bound)
for the number of CGCs for (CGC_index=0; CGC_index <n; CGC_index ++) while (col_idx < number of ops in a row && col_idx< number of uncovered DFG nodes) map_to_CGC(DFG_node, CGC_index, col_idx) end while; end for; end for; end while; Since the data-path is realized by a fixed number of CGCs, the scheduling is a resource-constrained problem with the goal of latency minimization. The design choice was to use a list-based scheduler. In the CGC-based data-path, the list scheduler is simplified, since it handles one resource type, which is the CGC node. This occurs because the input DFG consists of ALU and/or multiplication type of operations (or it is transformed to be composed only by these operations) and each CGC node contains an ALU and a multiplier. Thus, each DFG node is considered as one resource type.
Due to the features of the CGC data-path a simple but efficient, algorithm is used to perform binding. The pseudo-code of the binding algorithm is shown in Figure 6 . The input is the scheduled DFG, where each c-step has T prim clock period, and it is called c-step prim . After binding, the overall latency of the DFG is measured in new clock cycles having period T CGC . This period is set for having unit execution delay for the CGCs. The CGC binding algorithm maps row-wise the DFG nodes to the CGC nodes.
A term called CGC_index is defined that it is related to the new clock period T CGC after the binding is performed. This term represents the current level of CGC's operations that bind the DFG nodes. The CGC_index takes the values from 0 and n-1, since a CGC consists of n rows of nodes. The algorithm covers the operations in c-step prim (s) for CGC_index equal to 0 or until there are no DFG nodes left uncovered in a c-step prim . Then it proceeds to the next value of CGC_index till equal to n-1, if there are any uncovered DFG nodes left. This procedure is repeated for every CGC in the data-path. A CGC is not utilized, when they are no uncovered DFG nodes left. Also, a CGC is partially utilized when there is no sufficient number of DFG nodes left and the mapping to CGC procedure (map_to_CGC) has already started for this CGC. If there are p nxm CGCs in the data-path, the maximum number of operations per c-step prim is equal to p ⋅ m. Finally, after CGC binding the register bank's size is determined.
Experimental Results
A prototype tool has been developed in C++ for demonstrating the efficiency in terms of performance of the introduced CGC data-path. The DFGs used in the experiments were obtained from representative benchmarks described in VHDL and C language. Table 1 shows the number of DFG nodes (operations) of the considered benchmarks.
We have synthesized the control-units of small DFGs (fir11, volterra and ellip) for a data-path composed by two 2x2 CGCs and we have measured the delay of these units. The specification of the control-units has been performed manually since by this time we do not support a method for automatically defining control-units. For the synthesis of the derived control-units, a 0.13μm CMOS ASIC library has been used. The average delay of the control-units is a small fraction (approximately 10% in average) of the delay imposed by a 2x2 CGC. This indicates that the delay of the control-unit does not affect the criticalpath of the proposed data-path, which is defined by the CGC's delay as mentioned in section 5. Analogous results in the control-unit's delay are expected for DFGs consisting of large number of nodes (e.g. the gsm_enc).
For the selected set of benchmarks, another experiment is performed which shows that a data-path with two 2x2 CGCs achieves an average clock cycles decrease of 57.3% when compared with a data-path composed by primitive resources. The detailed results are shown in Table 1 .
The clock cycle of the primitive resource data-path is set to the delay of an ALU with the similar functionality as the one in a GCC node. From the reduction of clock cycles in Table 1 , it is easily proved that the performance of the CGC data-path is higher if:
T CGC < 2.3 ⋅ T prim (1) where T CGC and T prim are the clock periods for the CGC and the primitive resource based data-path, respectively. Eq. (1) has been satisfied after comparing the delay of a 2x2 CGC with a 16-bit ALU delay, where both resources are implemented in structural VHDL and synthesized at a 0.13μm CMOS ASIC library using the LeonardoSpectrum tool (Ref. 20) .
The results showed that T CGC =2.14 ⋅ T prim . If the physical layout of the CGC is manually optimized, then the T CGC should become smaller. The same experiment was performed for other types of CGCs (e.g. the 2x3) and showed that the corresponding equations like (1) do not impose hard design constraints. Similar constrains, like eq. (1) have been also satisfied for other template-based methods as in Refs. 4 and 7, where properly designed templates outperformed the combination of primitive resources. In the last column of Table 1 , the CGC usage in the reconfigurable coarse-grain datapath is given. The CGC usage is the ratio of the number of CGC instances enabled in binding, to the product of the number of CGCs in the data-path times the latency. A CGC usage equal to 1 implies that all the CGCs in the data-path are utilized among all c-steps. From Table 1 is concluded that for the majority of the benchmarks, the two 2x2 CGCs of the data-paths are (or nearly) utilized. So, if this data-path targets a family of applications consisting of the DSP and multimedia kernels presented in this experimental set-up, the CGCs are fully utilized most of the times in each c-step. Also, all the CGCs are utilized at least once in the majority of the c-steps of a benchmark. Thus, this leads to an efficient design of the coarse-grain part of the reconfigurable SoC platform template presented in Figure 1 .
A third experiment is performed to compare the performance and the area utilization of the introduced data-path with another high-performance data-path, which contains templates. The template library consists of the following templates: multiply-multiply, multiply-alu, alu-alu, and alu-multiply. These templates are chosen because they are proposed by the majority of the existing methods (Refs. 5, 7) to be used for deriving high-performance data-paths for DSP applications. The CGC data-path consists of two 2x2 (case 1) and two 2x3 (case 2) CGCs. So, in case 1, four operations can be executed concurrently in each c-step, while in case 2, six operations. The main assumptions of this experiment are: (a) template partial matching is enabled, and (b) the clock periods for the template and the CGC-based data-path are set for having unit execution delay in both data-paths. The template partial matching enables the full DFG covering, without needing extra primitive resources (res.) to be present in the data-path; this is also the case for the CGC-based data-path. As shown in Section 5, the delay of a 2x2 and a 2x3 CGC compared with a template of two multipliers in a sequence, is marginally larger. Hence, the clock periods in both data-paths are virtually equal; thus the performance comparison in Table 2 and Table 4 is straightforward.
To derive the latency values for the template-based data-path, covering (binding) with the templates is performed in the unscheduled DFG and then scheduling is performed by a list scheduler. Template covering is performed so as the available primitive computational resources (multipliers and/or ALUs) in each c-step is equal with the ones available from the CGC-based data-path.
As it is illustrated in Table 2 , the CGC data-path achieves better performance (lower latency values) than the data-path consisting of templates, since fewer clock cycles are required to implement the considered benchmarks. Table 2 . Latency results when the CGC data-path is compared with a template-based one (template partial matching enabled).
uncovered nodes of the DFG are assumed to be implemented by primitive resources realized in FPGA technology, like in Ref.
5. The template library is the same as the one in the previous experiment. The clock period of the ASIC components T ASIC (i.e. the templates) is set to the delay of the multiply-multiply template. In the case of the FPGA hardware, the clock period T FPGA is set to the delay of the multiplier unit. In this experiment we assume that T FPGA =2 ⋅ T ASIC , which a rather modest assumption for the performance gain of an ASIC technology compared to an FPGA one. To simplify the synchronization problems between the FPGA and the ASIC hardware, we assume: (a) a closely coupled template data-path and FPGA hardware, and (b) a clock period set to T ASIC . So, the DFG operations mapped to the FPGA hardware have an execution delay of 2 clock cycles, and the ones mapped to the template data-path have unit-execution delay. As deduced from the results of Table 4 , the latency reduction (thus performance improvement) is even greater (approximately 44%) when the CGC data-path is compared with a template-based one that does not support partial matching.
Conclusions -Future Work
A high-performance reconfigurable coarse-grain data-path, part of a hybrid platform, has been presented in this paper. An automated methodology for mapping applications to this data-path was also developed. Important performance gains have been achieved compared with primitive and template-based data-paths. Future work considers the development of a temporal partitioning methodology, suited to the proposed data-path's features, that takes into consideration the sharing of the CGCs.
