Embedded CPUs typically use much less power than desktop or server CPUs but provide limited or no support for floating-point arithmetic. Hybrid reconfigurable CPUs combine fixed and reconfigurable computing fabrics to balance better execution performance and power consumption. We show how a Stretch S6 hybrid reconfigurable CPU (S6) can be extended to natively support double precision floating-point arithmetic. For lower precision number formats, multiple parallel arithmetic units can be implemented. We evaluate if the superlinear performance improvement of floating-point multiplication on reconfigurable fabrics can be exploited in the framework of a hybrid reconfigurable CPU. We provide an in-depth investigation of data paths to and from the S6 reconfigurable fabric and present peak and sustained throughput as a function of wide registers used and total operand size. We demonstrate the effect of the given interface when using a floating-point fused multiply-accumulate (FMA) SIMD unit to accelerate the LINPACK benchmark. We identify a mismatch between the size of the S6's reconfigurable fabric and the available interface bandwidth as the major bottleneck limiting performance which makes it a poor choice for scientific workloads relying on native support for floating-point arithmetic.
Introduction
Power consumption has become a major concern in scientific computing [1, 2] , leading to significant interest in use of power-efficient CPU architectures as typically found in embedded and mobile computing platforms. Embedded and mobile computing platforms, on the other hand, face a steep increase of complexity in their native application set (like speech recognition or augmented reality on mobile phones) approaching the requirements typically found in scientific applications. One typical key requirement of scientific applications is support for efficient single-or double precision floating-point arithmetic. At the intersection of these two trends, we find highly capable yet power-efficient embedded CPU architectures. Several projects rely on such specialised architectures to build power-efficient large scale computing infrastructure [1] .
Fixed computer architectures need to compromise on many architectural details to appeal to a large audience. We are interested in exploring the suitability of hybrid reconfigurable CPUs for use as power efficient computing platforms for scientific workloads. Hybrid reconfigurable CPUs allow for selective customisation of the data-path through extension instructions. In contrast to fully reconfigurable devices like FPGAs, relying for most parts on a proven static CPU architecture significantly reduces development effort. Hybrid reconfigurable CPUs thereby allow for significantly improving a given application's performance with low development overhead. The more complex the custom instruction, i.e. the higher the equivalent number of instructions from the original instruction set, the higher is typically the achievable performance gain.
Designers of hybrid reconfigurable CPUs are faced with the question of how to combine best the respective strengths of fixed instruction-set architectures and reconfigurable logic.While the peak performance of hybrid reconfigurable CPUs (subject to application requirements and suitable configuration) is mostly defined by the size of the reconfigurable fabric, the achievable sustained performance relies heavily on the (static) interface between reconfigurable fabric and fixed CPU.
Over the last decade, short-vector single-instruction multiple-data (SIMD) units have been integrated into mainstream CPUs [3, 4, 5] . This trend extends to floating-point units. As a result, most mainstream CPUs can issue either a single double precision operation or two single precision operations.Given identical issue rates, the floatingpoint peak performance doubles if single precision data types are used instead of double precision data types. This has led to a renewed interest in algorithms relying partly on lower precision number formats to achieve a result in some higher-precision number format [6] . Algorithms using a limited set of floating-point number formats are called mixed-precision algorithms [7] . While static FPUs typically provide a linear improvement in parallelism with decreasing precision, FPGAs provide quadratic improvement in parallelism with decreasing precision for selected operations [8, 9] . Implementation of complex numerical algorithms in reconfigurable logic requiring specification in a hardware description language (HDL) appears both laborious and error-prone. Hybrid reconfigurable CPUs provide a means for efficient coding using a reliable software stack while potentially delivering a superlinear performance gain for lower precision arithmetic operations.
This paper aims at verifying the assumption of superlinear performance gain for lower precision arithmetic operations on the Stretch S6 hybrid reconfigurable CPU [10] . This is achieved by investigating in detail the interface between reconfigurable fabric and fixed CPU considering typical requirements of extension instructions using singleand double precision operands. The S6 was chosen, because it is one of the few commercially available hybrid reconfigurable CPUs.
LINPACK is a well known benchmark to characterise floating-point performance of computers and widely used both in academia and industry [11] . LINPACK does not require the use of specific number formats but demands the solution to achieve a given accuracy. We will demonstrate the effect of different interface limitations on minimum issue rate and sustained performance using the LINPACK benchmark when using either double-or single precision number formats. Based on this characterisation, the performance of mixed-precision solvers can be modeled.
To the best of our knowledge, this work is the first one considering different floating-point number formats in the context of reconfigurable hybrid CPUs. We specifically show how the S6's interface defines achievable performance of double precision extension instructions and how increased peak performance for single precision extension instructions can not be exploited due to interface restrictions.
The remaining of this paper is organised as follow. We first summarise related work in Section 2 and present the Stretch reconfigurable hybrid CPU architecture in Section 3. We then describe our implementation of double precision and short-vector SIMD single precision fused multiply-accumulate (FMA) operations as extension instructions on the S6's instruction-set extension fabric in Section 4. Section 5 presents and discusses performance of LINPACK on the S6 when accelerated using different extension instructions. Section 6 investigates in detail the S6's interface between fixed Xtensa CPU core and reconfigurable fabric. The findings are used to explain the observed LINPACK performance as well as discussing architectural limitations and possible future improvements. Finally, Section 7 gives our conclusion and sketches future work.
Related Work
Several hybrid reconfigurable CPU architectures exist, among them MOLEN [12] , GARP [13] and Stretch [10] . An overview can be found in [14, 15] . Most publications presenting a reconfigurable CPU include some design space exploration for applications originating from multimedia benchmarks (a favorite usage scenario for reconfigurable CPUs). We believe floating-point applications are underrepresented in design space exploration for hybrid reconfigurable CPUs, potentially leading to suboptimal interface design.
Jin et al. [16] added an IEEE-754 compliant single precision FPU to the eMIPS (extensible MIPS) microprocessor (the eMIPS exists as a soft-core design and was synthesized on a Xilinx FPGA). A single precision (SP) fused operation that integrates 25 basic operations was implemented and evaluated by executing single precision LINPACK.
Huynh et al. [17] demonstrated implementation of a double precision floating-point FMA operation within the reconfigurable fabric of the commercially available Stretch S6 hybrid reconfigurable CPU.
Ramkumar et al. [18] pointed out the problematic mismatch of increasing number of operands consumed by reconfigurable fabrics and limited register file bandwidth on typical RISC platforms.
Iterative refinement strategies in floating-point arithmetic were first analyzed by Moler [19] . Recently, mixed precision iterative refinement received new attention due to the availability of short-vector SIMD units in mainstream CPUs [6] . We are currently investigating arbitrary-precision iterative refinement and its performance on reconfigurable devices [9] .
Stretch S6 Hybrid Reconfigurable CPU
The Stretch S6 [10] is a hybrid reconfigurable embedded CPU which combines a fixed Tensilica Xtensa LX instruction set architecture with a dynamically (i.e. at run time) reconfigurable Stretch extension unit. In the following we describe the S6's architecture and summarise typical application development on the S6 hybrid reconfigurable CPU. Figure 1a ) gives an overview of the Stretch S6 architecture. The 32-bit Xtensa LX core (blue) can run at a clock frequency of up to 300 MHz and the programmable Instruction Set Extension Fabric (ISEF, yellow) can run at clock frequencies identical, 1/2 or 1/3 of the Xtensa clock frequency. The Xtensa core is equipped with an IEEE-754 single precision floating-point unit. Double precision arithmetic has to be emulated in software. The typically used emulation relies on the gcc soft float routines (contained in libgcc and normally used by gcc when generation of floating-point instructions is disabled).
Architecture

ISEF.
The ISEF is an array of reconfigurable computational resources, memories, registers and respective interconnect, which can be used to implement user-defined extension instructions. Between the Xtensa core and the ISEF, data is transferred via 128-bit Wide Registers (WR), using a maximum of three registers for input and two registers for output. The ISEF supports full pipelining of extension instructions with up to 27 pipeline stages. The ISEF's computational resources comprise 4096 arithmetic units (AUs) for bitwise addition/subtraction and logic operations and 8192 multiply units (MUs) for bitwise multiply and shift operations. The ISEF features 64kB of embedded RAM (IRAM) which can be accessed from the Xtensa core via fast direct memory access (DMA).
There are four fundamental sources of potential performance gains when offloading computations to the ISEF [20] : (i) Instruction specialization: As extension instructions serve only a single application, they can be much more specific than general-purpose instructions; (ii) Spatial parallelism: The ISEF allows for implementation of parallel data paths, limited only by the number of available ISEF resources and the width of input-and output registers; (iii) Temporal parallelism: Up to 27 pipelining stages; (iv) Embedded memory: The ISEF features multiple embedded memories providing massive bandwidth at very low latency to access look-up tables or to keep temporary data.
Byte-Streaming. The Stretch S6 provides byte-streaming load-store instructions, which allow for transferring of 1 to 16 bytes between WRs and memory while implicitly updating the memory address with an increment or decrement. The S6 CPU provides three independent load-streams (RAM to WR) and one store-stream (WR to RAM). After initialisation, streaming loads and stores take just one cycle to execute, as long as the data resides in data-cache or dual-port dataram.
Application Development
Application development for the Stretch CPU typically starts with a new or existing C or C++ program running on a sequential CPU platform [10, 20] . The code is profiled and analysed to identify the code segments, typically inner loops, which consume most of the execution time. These identified code segments will then be replaced by user-defined extension instructions, implemented in the ISEF, and invoked from the main program as C intrinsics. The Stretch C Compiler (SCC), which is based on gcc, maps ordinary C code into a series of instructions to run on the Xtensa processor, and synthesises Stretch C code obtainig a bitstream for ISEF configuration. Once the user-defined extension instructions have been defined in Stretch C, the extension instructions are compiled by SCC. A header file defining the intrinsics associated with the extension instructions is created and included in all ordinary C files in which the extension instructions are used. Stretch C is ANSI C with a few enhancements and limitations [20] . The enhancements include data types of parameterizable bit width and operators for packing and unpacking bits within longer words. The wide registers (WRs) build the interface between ISEF and Xtensa, holding the input to extension instructions as well as the computed result. The Stretch S6 ISA provides a variety of load/store instructions to transfer data between memory, Xtensa and WRs.
Floating-Point FMA Extension Instructions on S6
The Xtensa core is equipped with one IEEE single precision floating-point unit. Our aim is -using the S6's reconfigurable fabric -to natively support arithmetic operations any desired floating-point number format up to double precision. To maximise throughput, we want to reuse remaining logic resources for additional parallel units. In the following, we describe the implementation of an extension instruction performing IEEE-754 compliant (with the support of round-to-nearest and the exception of denormalised numbers) floating-point fused multiply-accumulate (FMA). FMA combines a floating-point multiplication followed by a floating-point addition. As one rounding step can be omitted, FMA provides more accurate results and higher performance compared to a sequence of floating-point multiplication and addition [21] .
Floating-Point Arithmetic. Floating-point arithmetic is the standard approach for approximating real number arithmetic in modern computers [22] . Compared to fixed-point numbers, floating-point numbers cover a larger dynamic range using less bits at the cost of non-uniform resolution. Floating-point numbers have a fixed relative error while fixed-point numbers have a fixed absolute error. The advantages of floating-point arithmetic come at the cost of more complex hardware implementation. The IEEE-754 standard [23] specifies number formats and operations for floating-point arithmetic on digital computer systems. Many algorithms to implement floating-point addition and multiplication on digital logic have been presented [24, 22] . In this work, we follow textbook implementations in [22] .
Floating-Point Fused Multiply-Accumulate (FMA).
The fused multiply-accumulate (FMA) operation performs Z = X 1 · X 2 + X 3 with a single rounding as an indivisible operation, thereby providing a more accurate result compared to multiplication, rounding, addition and rounding. An FMA can therefore potentially increase both the accuracy and performance of many computations involving the accumulation of products including dot product, matrix multiplication or Newton's iteration for function approximation. DFMA Extension Instruction. Figure 1b) depicts how the DFMA extension instruction's inputs and output are aligned in the wide registers. The DFMA extension instruction accepts three input operands from three 128-bit wide registers, and places the corresponding computed result Z = X 1 · X 2 + X 3 in one 128-bit wide register. Since the double precision number format is 64 bits wide, only half of each wide register is used. Figure 1c ) depicts how the SFMAx extension instruction inputs and output are aligned in the wide registers. Table 1 reports the ISEF resource usage (AUs, MUs, pipeline stages, routable) when one, two, three or four SFMA operators are implemented in parallel. The ISEF's computational resources would be sufficient to implement up to four parallel SFMA operators. The ISEF routing resources, however, are exhausted with the implementation of two SFMAs, making actual implementation of three or four parallel units impossible.
Choice of f Xtensa and f IS EF . The DFMA and SFMA extension instructions were specified Using Stretch C and compiled using SCC. Obviously, the goal is to achieve an ISEF implementation at the S6's maximum clock frequency of 300 MHz. The compiler, however, could not meet a target frequency of 300 MHz. Given the limited ISEF resources, the FMA's most critical path can be implemented at clock frequencies equal or less than 100 MHz for both the Xtensa core and the ISEF, only. At 100 MHz, SCC was able to synthesize the DFMA and SFMAx extension instructions. However, setting the clock frequency of the S6 PCIe board to 100 MHz is currently not possible. The next lower available clock frequency is 40 MHz.
Linpack Performance Evaluation
In the following, we will characterise the LINPACK benchmark, detail the benchmark's mapping onto the S6 hybrid reconfigurable CPU and report performance measurements for different implementation variants. We close with a discussion of the reported results.
LINPACK. The LINPACK benchmark [11] measures how fast a computer solves a dense N × N system of linear equations Ax = b. The implementation used in this work is based on a C code available on netlib 1 relying on BLAS Level-1 routines [11] . While there exist more efficient LINPACK implementations, our aim in this work is not to maximise LINPACK performance, but to demonstrate relative LINPACK performance among various CPUs and at different precision levels when exploiting short-vector SIMD extension instructions. LINPACK uses the BLAS routine DGEFA to perform the LU decomposition of the squared matrix A with partial pivoting and DGESL to solve the given system of linear equations by forward and back substitution. Most of the execution time of LINPACK is spent in DGEFA, of which the largest part is spent in the DAXPY routine. DAXPY performs y = α · x + y, i.e. it multiplies a vector x with a scalar α and accumulates the result in vector y.
Experiment Setup.
The following experiments were set up, reflecting implementation choices available on the Stretch S6 hybrid reconfigurable CPU. For each implementation, the floating-point operators in DAXPY were replaced by function calls to the respective extension instruction. LINPACK Code. For the double precision experiment, we replace the original sequence of DP floating-point multiplication and addition in DAXPY by a single DFMA extension instruction. For the single precision experiments, we use either the S6's SP FPU or the SFMA and 2SFMA extension instruction, respectively. The data transfer is implemented using simple load store (external memory read/write and byte-streaming channels (internal memory to ISEF).
LDX {DP
The code was compiled using the Stretch C compiler (SCC) version 2010.01 (built on 5 Feb 2010). Used SCC flags were -stretch-effort10 and -O3. Both, Xtensa and ISEF were forced to run at a clock frequency of 40 MHz (i.e. f Xtensa = f IS EF = 40 MHz)
Measurements
For every experiment, we measure the total execution time in cycles. The estimated number of floating-point operations at system size N is (2/3N 3 + 2N 2 ) [11] . The LINPACK performance in floating-point operations per second (Flop/s) is calculated by dividing the number of estimated floating-point operations by the respective LINPACK execution time. Table 2 reports the performance of DP and SP LINPACK benchmarks achieved on Stretch S6 and on desktop CPUs for systems of size N=500. The performance of DP LINPACK using software-emulated floating-point arithmetic via Xtensa ALU is about 0.5 MFlop/s. By providing native DP floating-point arithmetic through the DFMA extension instruction, the performance achieved by DP LINPACK on S6 is 12.6 MFlop/s, This corresponds to a speed-up of about 25 times compared to DP software-emulated LINPACK. The desktop CPUs operating at much higher clock frequencies significantly outperform the Stretch S6 hybrid CPU in raw performance (MFlop/s). Accepting a lower accuracy by using SP arithmetic, the SP LINPACK implementation using the native SP Xtensa FPU achieves a sustained performance of 7.0 MFlop/s at 40MHz. Using an extension instruction implementing one and two SP FMAs in parallel at 40 MHz, achievable LINPACK performance becomes 10.1 MFlop/s and 13.6 MFlop/s, respectively. This is about 1.5 and 2 times more efficient than SP LINPACK using the native SP Xtensa FPU at the same clock frequency.
Columns three and four in Table 2 present the performance efficiency and the power efficiency of all LINPACK implementations on S6 CPU. The best performance efficiency is 16% with DFMA for DP Linpack, and 13% with SFMA for SP Linpack. The maximum power consumption of the PCIe expansion card holding four Stretch S6 CPU is 25W. We can currently not measure the actual power consumption and therefore assume a worst-case scenario of 25W/4=6.25W per S6 CPU. Dividing sustained performance by power consumption gives a worst-case estimate of power efficiency at N=500 of 2.0 MFlop/W and 2.2 MFlop/W for DP and SP LINPACK implementations on S6 CPU, respectively. Performance Measure Cycles per Extension Instruction (CPEI). In analogy to the cycles per instruction (CPI) measure used in CPU design [25] , we will characterise the performance of extension instructions quoting the cycles per extension instruction (CPEI). It is equivalent to the extension instruction issue rate. The CPEI for some program or code section is calculated as the ratio of the executed Xtensa clock cycles n Xtensa and the executed extension instructions n IS EF To compare the efficiency of Linpack against the best possible performance achievable on the Stretch S6, we calculate the corresponding CPEI for each implementation. Our Linpack implementation performs n 500 = 83833333 Flops at a system size of N=500. Given the equivalent number of floating-point operations per extension instruction (FPEI), we can calculate the CPEI as #cycles · FPEI/n 500 .
Inspecting Table 3 shows that the CPEI using DFMA is 6.4. Using a single SFMA results in a CPEI of 7.9 while using two SFMAs gives a CPEI of 11.8. For SP Linpack on Xtensa FPU (no ISEF), we observe that the Xtensa FPU performs a single precision fused multiply-accumulate instruction within the routine SAXPY, with respective assembly code madd.s, thereby resulting in a FPEI of 2 and leading to a CPI (cycle per instruction) of 11.5.
Residual report. In accordance with [11] , we report the respective residuals of solutions solved by all LINPACK benchmarks performed in Table 4 ; N is the system size, is the relative machine precision. The matrix A is a random matrix generated by the LINPACK code. For reference, we report the condition number κ of A using the 2-norm: κ(A) = ||A|| 2 ||A −1 || 2 . Note that LINPACK implementations using the SFMA and DFMA extension instructions on the S6 ISEF provide more accurate results compared to implementations without fused operations due to the lower number of rounding operations.
Discussion
The S6's reconfigurable fabric is able to provide support for complex floating-point operators like fused multiply accumulate (FMA). For double precision LINPACK, this leads to a speed-up of 25 compared software-emulated double precision arithmetic. For single precision LINPACK, the fused operation outperforms the implementation 
S6 ISEF Interface Performance Characterisation
Section 4 outlined the FMA implementation and made naive assumptions about achievable peak throughput. LIN-PACK performance figures reported in Section 5 showed that sustained performance achieved by the benchmark are significantly lower. This section details the interface between ISEF and S6 on-chip memories. We derive theoretical peak throughput from architectural features and present respective measurements. Detailed understanding of the ISEF's interface allows for a better explanation of the observed LINPACK performance as well as a detailed documentation of inherent performance degradation due to S6 architectural limitations.
Bandwidth Requirements. A key feature of reconfigurable fabrics is the fact that some arithmetic operation's complexity increases superlinear with the precision [8, 22, 26] . When reusing freed resources for additional parallel units, reducing precision can therefore lead to superlinear parallelism with decreasing precision. Superlinear parallelism, however, leads to increased total required bandwidth (i.e. if a double precision unit can be replaced by four single precision units, each accepting two operands, the total required bandwidth doubles from 1*64 bit to 4*32=128 bit, assuming an issue rate of 1/unit*cycle). Exploitation of the increased parallelism is subject to availability of this bandwidth. We are therefore interested in understanding all effects (both architectural and compiler-induced) that influence the achievable data transfer bandwidth to and from the ISEF.
ISEF Interface. The S6's execution unit connects on-chip memory (D-Cache and DataRAM) and ISEF via 128-bit wide buses. Wide registers (WRs) act as interface for data transfer to and from the ISEF. We are interested in understanding the implications of (i) the number of input WRs used and (ii) the size of operands used (i.e. bits used in each WR) on achievable throughput of custom SIMD extension instructions using the streaming interface on S6 CPU. In the following, we derive minimum CPEI (CPEI min ) for different extension instruction configurations from architectural features and perform experiments to obtain the respective average CPEI (CPEI).
S6 Minimum CPEI CPEI min .
The on-chip memory system and the wide register (WR) file are linked with a single 128-bit wide data bus, allowing for a load or a store of at most one 128-bit WR every clock cycle. The Stretch S6 fixed CPU design is an Xtensa LX dual-issue core whose execution unit is able to issue two instructions every clock cycle. As a consequence of the dual-issue architecture, an extension instruction and a WR load or a WR store can be issued simultaneously. Therefore, if an extension instruction consumes (or writes) only a single WR, the absolute minimum issue rate is 1. Every additional WR load or store operation increases the CPEI by one.
The CPEI min for all selected S6 extension instruction configurations is reported in Table 5 . In summary, an S6 extension instruction reading and writing into two different wide registers can be issued every two Xtensa clock cycles, in case all data is accessible in local memory. Every additionally used register increases the CPEI min by one clock cycle. 
Experiments
A small test program measures the average CPEI min achievable as a function of the number of input wide registers and operand size. Byte-streaming channels are used for efficient data transfer to and from the ISEF (cf. Section 3). For ease of understanding and presentation, two simple extension instructions DNEGx and SNEGx were implemented using Stretch C, performing negation for double precision and single precision floating-point operands, respectively. For each extension instruction, there exist variants reading data from the lower part of one (DNEG1, SNEG1), two (DNEG2, SNEG2) or three (DNEG3, SNEG3) wide registers. The result of the operation is always a single floating-point number. The extension instruction is executed in a loop reading data from on-or off-chip memory. For all experiments, the Xtensa core runs at 300 MHz, and the Xtensa core and ISEF were forced to run at the same clock frequency by using SCC flag -stretch-issue-rate 1, i.e. f ISEF = f Xtensa = 300 MHz.
On-chip memory access
For on-chip memory access, byte-streaming channels are used. For each extension instruction (i.e. using 1, 2 or 3 input WRs), the input operand size is varied between 32, 64, 96 and 128 bit. The resulting sustained CPEIs of the SIMD extension instructions using byte-streaming channels when the data reside in on-chip memories are reported in Table 6 . The measured CPEIs when data is in D-Cache and in DataRAM are almost equal. Therefore only the smaller measured CPEI is chosen and reported.
Discussion
For data transfer between on-chip memory and wide registers via byte-streaming channels, the CPEI min is expected to depend on the number of WRs used, but not on the size of the operands within a WR. This is confirmed by our measurements. For configurations using two and three input WRs, the measured CPEI almost matches the CPEI min . For configurations using one input WR, the measured CPEI exceeds the expected CPEI min by one clock cycle.
Conclusion
Hybrid reconfigurable CPUs are prime candidates for power-efficient acceleration of demanding signal processing applications. Our goal was to investigate the extent to which this applies to the domain of scientific computing, especially floating-point arithmetic.
Benchmarking LINPACK using a floating-point FMA extension instruction showed the functional viability of using the S6 for scientific workloads, but achieved disappointing performance figures. The low figures were due to (i) a low clock frequency of 40 MHz compared to achievable 100 MHz for the extension instruction (ii) a maximum clock rate of 300 MHz for the Xtensa core and (iii) a low issue rate of extension instructions (cf. Table 3 ). This work explored architectural limitations leading to the low issue rate.
Reconfigurable fabrics can provide superlinear parallelism when implementing short-vector SIMD units for selected arithmetic operations in reduced precision. This genuine advantage of reconfigurable logic can only be exploited in reconfigurable hybrid CPUs if the interface between reconfigurable logic and fixed CPU can provide the necessary bandwidth for data transfer. In this work we have explored the data bandwidth of the interface between reconfigurable fabric and fixed CPU of the Stretch S6 hybrid reconfigurable CPU.
We derived minimum cycles per extension instruction between 1 and 4, depending on the number of WRs used. The streaming channels work as expected, decoupling extension instruction issue rate from the amount of bits consumed by each WR. Except for one case (single input WR), our CPEI measurements confirm the expected minimum cycle values. Both, minimum and measured CPEI increase with the number of WR's used.
The Stretch S6 features a large and versatile reconfigurable fabric with impressive I/O (3x128 bit in, 2x128 bit out). The surrounding infrastructure does not match these capabilities, however, limiting the overall data transfer to 128 bit per clock cycle. Fast floating-point arithmetic relies on efficient transfer of large operands. Given the S6 CPU design, multiple arithmetic units -although implementable in the ISEF -can not be fed with the respective data, resulting in frequent stalls and inefficient program execution. Therefore, the Stretch S6 seems unsuitable for scientific workloads due to the limited bandwidth between ISEF and on-chip memories.
Outlook. The dominant issue identified in this work is the mismatch between ISEF resources and WR/Xtensa bandwidth. We will investigate if other hybrid reconfigurable CPUs provide a better match of resources. The Stretch S7 features improved ISEF routing resources. We plan to evaluate if our presented SIMD design scales better (i.e. more parallel units can actually be implemented) on the S7.
