The key to enabling widespread use of FPGAs for algorithm acceleration is to allow programmers to create efficient designs without the time-consuming hardware design process. Programmers are used to developing scientific and mathematical algorithms in high-level languages (C/C++) using floating point data types. Although easy to implement, the dynamic range provided by floating point is not necessary in many applications; more efficient implementations can be realized using fixed point arithmetic. While this topic has been studied previously [Han et al. 2006; Olson et al. 1999; Gaffar et al. 2004; Aamodt and Chow 1999] , the degree of full automation has always been lacking. We present a novel design flow for cases where FPGAs are used to offload computations from a microprocessor. Our LLVM-based algorithm inserts value profiling code into an unmodified C/C++ application to guide its automatic conversion to fixed point. This allows for fast and accurate design space exploration on a host microprocessor before any accelerators are mapped to the FPGA. Through experimental results, we demonstrate that fixed-point conversion can yield resource savings of up to 2x-3x reductions. Embedded RAM usage is minimized, and 13%-22% higher F max than the original floating-point implementation is observed. In a case study, we show that 17% reduction in logic and 24% reduction in register usage can be realized by using our algorithm in conjunction with a High-Level Synthesis (HLS) tool.
INTRODUCTION
The use of Field Programmable Gate Arrays (FPGAs) for general-purpose algorithm acceleration remains a promising field. Numerous studies have shown that FPGAs have better performance (in speed, resource usage, or power consumption) when implementing algorithms such as image processing [Cope et al. 2005] , HD-video [Fowers et al. 2012] , FFTs, and linear algebra routines [Strickland and Langhammer 2008; Zhang et al. 2009; Johnson et al. 2008] . Unfortunately, FPGAs have not gained widespread acceptance in algorithm acceleration when compared to technologies such as multicore CPUs and GPUs.
There are two major reasons for this seeming discontinuity. First, there is an inherent mismatch between the capabilities of the FPGA and the traditional application Authors' address: D. Chen (corresponding author) and D. Singh, Altera Corporation; email: dochen@ altera.com. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from model. FPGAs are not well suited to implement entire high-level applications. The large amounts of fine-grained parallelism available on the FPGA are better suited for implementing small, highly data parallel, and performance-critical computational kernels. The remainder of the sequential control and setup code is better implemented on a microprocessor. Both Altera and Xilinx have announced FPGA products with integrated ARM CPU cores [Balough 2011; Xilinx 2011 ] to handle applications of this nature as shown in Figure 1 . Using these devices, programmers will have access to traditional processors as well as highly parallel FPGA hardware for acceleration. In addition, design flows targeting hybrid processor-FPGA environments such as Altera's OpenCL Compiler [Singh 2011 ] have also been announced to allow for design of algorithms at a higher level of abstraction than HDL.
A second reason limiting the use of FPGAs is the dominant use of floating point in algorithm development. Floating-point numbers can represent a larger range of real values for a given number of bits than fixed-point equivalents. The number is represented using a sign bit, a fixed number of significant digits, and an exponent which scales the significand as shown by the following equation.
Floating-point representations greatly simplify programming because the developer does not have to track the location of the decimal point in the binary representation of a given value. However, hardware implementations of floating-point operations are generally much bigger and more complex than implementations of fixed-point operations. The main difference is that floating-point numbers require additional complex logic to handle the computation of the exponent. In addition, the floating-point standard have special representations to indicate values such as infinities.
Although FPGAs have improved in implementing floating-point computations [de Dinechin et al. 2008; Langhammer 2011] , even outperform GPUs in some applications [Strickland and Langhammer 2008] , their greatest strength is the ability to customize a datapath exactly to the needs of an application. For example, there are numerous applications where the dynamic range of floating point is unneeded, and can be more efficient when fixed point representations are used [Langhammer 2011] . Although fixed point is well understood in hardware design, it is incredibly complex to restructure an algorithm to use fixed-point types while producing a correct and efficient implementation that minimizes quantization error.
Proposed Solution
Given the two factors described before, we propose a fully automated floating-point to fixed-point conversion framework for applications written in C for a microprocessor coupled with an FPGA accelerator. The hybrid environment allows us to use techniques that are difficult or computationally inefficient when using only an FPGA. At a high level, our flow comprises the following steps.
(1) Parse the C program in Clang [Lattner 2007 The basis for the conversion algorithm is described by Cilio and Corporal [1999] . (5) Evaluate Error in the result by running the fixed point application on the CPU.
Although many previous works show theoretical models for error in simple cases, our approach transforms the application so the actual error can be experimentally determined. This is especially important in cases like loop accumulation where running sums may accumulate quantization error. (6) Convert the fixed-point datapaths to HDL. Our CAD tool automatically builds pipelined accelerators for regions of logic converted to fixed point. Our focus in this work is to examine the logic savings of these accelerator blocks in comparison to full floating-point implementations. We do not cover the interface between processors and accelerator in this article.
In contrast with previous works [Han et al. 2006; Olson et al. 1999; Gaffar et al. 2004; Brown et al. 2008; Aamodt and Chow 1999] , we believe this article makes the following unique contributions.
-This is the first article which speaks about compiler-level floating-point transformations to the Intermediate Representation (IR). The LLVM-based framework allows us to profile and transform code, and retarget this IR to different, but functionally equivalent backends of a microprocessor and an FPGA. -We are targeting the area of algorithm acceleration where the FPGA is coupled with a processor. This implies that there are often format conversions to and from floating point which need to be optimized. We have not seen this trade-off described previously. -Because LLVM IR is target independent, we can easily instrument and optimize the characteristics of the application on a processor before we target the FPGA. -Our flow does not attempt to transform the entire application to fixed point. Since some floating-point operations may be more efficient than others on an FPGA, we use a high-level resource estimation and error model to decide which portions of the application should be converted. -We present comparisons to modern, highly optimized floating-point cores and show detailed trade-offs and application characteristics that benefit from this flow. -We perform a case study to evaluate our algorithm within a High-Level Synthesis (HLS) tool targeting FPGAs.
The rest of this article is organized as follows. First, we review relevant background and contrast with previous work. Next, we describe our LLVM-based compilation framework for automatic fixed-point conversion of floating-point applications. Using this framework, we evaluate the resource usage, speed, and Signal-to-Noise Ratio (SNR) performance of several benchmarks implemented on a Stratix IV-530 FPGA. Finally, our case study of automatic fixed-point conversion within HLS tools is presented.
BACKGROUND
In this section, we will briefly describe the complexity of the FPGA design flow, and the importance of fixed-point conversion in FPGA design. A detailed background of the fixed-point conversion process will be provided.
Microprocessors vs. FPGAs
Traditional microprocessor-based systems are highly efficient at performing floatingpoint operations. They contain a fixed datapath, with highly optimized functional units. These floating-point operations are often executed serially on each processor core, and can incur high runtimes. However, microprocessors are easy to program and are very easy for initial prototyping of algorithms. In contrast, the FPGA is a programmable computational fabric with a regular gridlike architecture, as seen in Figure 2 . The primary building block on the FPGA is the Logic Element (LE) surrounded by a configurable routing network. Each LE can implement a Boolean logic function of up to k inputs. The FPGA also contains customizable RAM and DSP blocks, typically arranged in columns on the device. Designers can customize the datapath exactly as needed, and replicate the datapath as the device would allow. Thus, the FPGA can parallelize execution of floating-point operations to achieve speedup versus traditional microprocessors. However, the resources necessary to implement floating-point operations on an FPGA may be substantial. Although DSP blocks can perform floating-point operations efficiently, they are limited in quantity and constrained in location, and can limit the number of floating-point operations that can be implemented on one device.
The FPGA also has a different programming model than microprocessors. Designs targeted at FPGAs are described using Hardware Description Languages (HDL) such as VHDL or Verilog. These languages are meant to be used to describe digital circuits, rather than algorithms. These designs are then compiled through FPGA CAD tools to efficiently map the design onto the fabric. As FPGA devices grow in size and functionality, the compilation process can be quite lengthy, and detrimental for prototyping.
Floating Point to Fixed Point
One way of reducing resource usage on FPGAs for floating-point operations is to convert floating-point numbers to fixed-point representations. Traditional integer operations can then be performed on these fixed-point representations with a much lower resource requirement. An excellent overview of the conversion process can be found in Cilio and Corporaal [1999] . A fixed-point number consists of three parts: a sign bit, the integer portion, and the fractional portion as shown in Figure 3(b) .
The number of bits representing each part, denoted by IW L (Integer Word Length) and FW L (Fractional Word Length) can vary depending on precision requirements. The total number of bits needed to represent the entire word is given by
The real value, v, represented in fixed point, is given by
The process of converting a floating-point operation to fixed-point is as follows: An example of how fixed-point conversion works is shown in Figure 4 . To add the two floating-point numbers 2.5 and 3.125 together, each must first be converted to 
5.625
Legend fadd -floaƟng-point add iadd -integer add a) b) fixed-point format. Ignoring the bias in the exponent, and showing the implicit 1, these two numbers are shown in binary as 1.01 and 1.1001 with exponents of 1, respectively. These numbers are multiplied by 2 FW L , where FWL is 5 for this example, and then converted to integer format. An integer addition is performed, and the result is converted to floating-point format by normalization. We divide by 2 FW L to obtain the final floating-point result.
Automatic fixed-point conversion is not trivial and can be complicated. First, the alignment of the decimal point during computation is crucial for correct results. For example, when we add two integer numbers, both operands must have the same IW L and FW L for correctness. Also, since we only have IW L bits to represent the integer portion of the number, it is easy to overflow as the result gets larger. Therefore it is necessary to explicitly keep track of the IW L bits required at every stage of the conversion. To accommodate these issues, the following rules are used for fixed-point addition and multiplication.
-Addition/Subtraction: Given the addition of two fixed-point numbers of IW L a and IW L b , the IW L c of the resulting sum is
Note that the +1 term is different than the formulation given in Cilio and Corporaal [1999] as we have found this is necessary to avoid overflow. -Multiplication: The IW L c of the product of two signed fixed-point numbers is
The additional logic required on the FPGA include a Floating-Point (FP) multiplier and divide for scaling operations, and barrel shifters for type conversions. This increase in logic may overwhelm any FPGA logic area savings from substituting the FP unit with an integer unit. However, we can amortize this area overhead if multiple FP operations are performed in sequence. Consider Figure 5 (a), where we have 2 FP adders followed by an FP multiplier. There is no need to convert the output of the adders before the data is consumed by the multiplier. Therefore, we only need to perform conversion on the inputs to adders, and outputs of the multiply. The huge cones of FP operations in many computationally intensive designs benefit from this, leading to substantial area savings. In addition, the multipliers and divides needed for scaling are very efficient to implement on the FPGA. Because scaling is by powers of 2, only the exponent of the floating-point representation needs to be changed.
Previous Work
Several previous works attempt to optimize floating-point designs by using fixed point. In Gaffar et al. [2004] , the authors propose a C++ library with specially defined types. The programmer is required to change here source-code to use the library types rather than standard float and double data types. Operator overloading is then used to construct an internal representation of the expression to be evaluated. The internal graph is then used to create a Xilinx System Generator project. The authors do not describe the fixed-point conversion process as it seems to be left to the Xilinx backend to handle. In addition, the tool is driven by error sensitivity metrics which require differentiable computations and operating points. This adds complexity to the design flow. In contrast, our flow requires no source-code changes. The use of automatic profiling in our flow also reduces design flow complexity.
In Olson et al. [1999] and Han et al. [2006] , the authors describe a Matlab-based environment for converting floating-point to fixed realizations. This is more suitable for DSP signal processing than general algorithm acceleration. The authors specifically study the optimization of the word length [Han et al. 2006 ] using an evolutionary algorithm, but do not detail the trade-offs in comparison to an optimized floating-point FPGA implementation.
The work in Brown et al. [2008] uses the Valgrind profiler to gather information about the range of floating-point values on x86 binaries. Valgrind is a x86-to-x86 JIT translation tool which allows instrumentation applications to be built. This information is used to reduce the size of shifters needed to implement floating-point operations on FPGAs. The main concern with this approach is that the captured values are from a binary executable which is highly optimized for an x86 target. This executable may include many code transformations. It is difficult to match the final assembly code back to the initial unoptimized compiler-intermediate representation. There is no automation shown to facilitate this process. Although this work shows good results for select benchmarks, the comparison made is to a parameterizable suite of floating-point cores [Belanovic and Leeser 2002] . This library is designed to allow research on varying floating-point bit widths; however, comparisons are not made against floating-point cores which are optimized for standard mantissa and exponent sizes defined by IEEE-754. Since normalization is only performed once per contiguous block of FP operations, a reduction of 46.5% in logic (ALUTs) and 27.8% in register count were achieved. Our work looks at a further extreme where full fixed-point operations are used.
AUTOMATIC FIXED-POINT CONVERSION
FPGAs are often used in data streaming systems. In such systems, the FPGA receives data from external devices and performs computations on-chip. These results are then sent out over external I/O interfaces as shown in Figure 6 . Many computational problems often have large datasets that must be processed and FPGA processing can be applied in a similar fashion by streaming the data from external memory banks into the FPGA fabric. The FPGA processes the data on-chip, and streams the result back to external memory for storage. The data streamed into the FPGA is often represented in a fixed point format, such as video streams containing values representing light intensity. In many cases, the application developer converts this input data into floating-point format to simplify development effort. When the floating-point algorithm is finalized, manual conversion to fixed point can be performed to create a smaller and faster implementation using the FPGA logic fabric. It is this very scenario that we address in this work. We provide a methodology for automatically converting to a fixed-point representation from high-level code written in C.
LLVM Framework
LLVM is an open-source project that provides modular and reusable components for creating compilation tools. The LLVM infrastructure is based upon a strongly typed virtual instruction set that is both target and language independent. The intermediate representation (IR) of a program can be restructured in numerous ways. For example, consider the source-code for a simple dot-product function in Algorithm 1.
First, this program is parsed by the Clang frontend which understands the syntactic and semantic meaning of the C program. After initial conversion to the LLVM virtual instruction set, a number of passes progressively optimize the IR. These include standard compiler optimizations such as dead code elimination, constant propagation, and removal of redundant instructions. After these passes, the original program is transformed into an LLVM IR shown in Algorithm 2.
ALGORITHM 2: Dot-Product IR. define void @dotp(float* %a, float* %b) { entry: %tmp8 = load float* %b, align 4 %tmp3 = load float* %a, align 4 %mul = fmul float %tmp3, %tmp8 %add = fadd float %mul, 0.000000e+00 %arrayidx.1 = getelementptr float* %a, i64 1 %arrayidx7.1 = getelementptr float* %b, i64 1 %tmp3.1 = load float* %arrayidx.1, align 4 %tmp8.1 = load float* %arrayidx7.1, align 4 %mul.1 = fmul float %tmp3.1, %tmp8.1 %add.1 = fadd float %add, %mul.1 %phitmp = fpext float %add.1 to double %1 = call @printf(..., double %phitmp) ret void } In this example, the loop has been completely unrolled. The resulting program has 4 load instructions, 2 floating-point multiplies, and 2 floating-point additions. The transformations we present work directly on this representation of the program.
The overall tool flow of automatic fixed-point conversion, shown in Figure 7 , is implemented as a pass in LLVM. For each basic block in the LLVM IR, 1 we identify the conversion set, which is a set of Floating-Point (FP) operations to convert to fixed point. The process of selecting the set of operations to convert is described in Section 3.2. The floating-point inputs of the conversion set are converted to signed-integer formats denoted by the fptosi step to prepare for fixed-point computation. Next, we create a fixed-point integer operator for every FP operation of the conversion set. This is discussed in Section 3.3.
After all fixed-point implementations have been created, the output must be converted back to floating point in the sitofp step. We can then replace the original outputs with the newly created outputs from fixed-point implementations. A sweep operation will then remove all the unused original instructions from the IR. The modified IR can now be used to generate code to run on the CPU, or to generate RTL code for execution on the FPGA (described in Section 3.6).
Selection of Conversion Set
Before any FP operations can be converted to fixed point, we iterate through all instructions of the basic block to identify FP clusters. First, we denote I ma as an FP add or multiply operation. We then define a FP cluster C as a collection of floating-point multiply, add, compare, and select 2 instructions (I) that are connected components in the DAG represented by the LLVM IR for the basic block (BB) under consideration.
Each FP cluster maintains a list (Inp) of cluster inputs and a list (Out) of cluster outputs. Cluster inputs are instructions that produce values used by operations within the cluster. Cluster outputs are instructions within the FP cluster that produce values that are used by at least 1 instruction outside the cluster. Thus, these outputs must be visible outside of the FP cluster.
The pseudocode for the cluster selection is shown in Algorithm 3. As we look through the IR for FP operations, we also keep track of inputs and outputs to the FP operations. If an instruction is an FP operation, but its operands are not, then the operand instructions are added to the set of FP cluster inputs. If an instruction is not an FP operation or a select or compare instruction, but its operands are, then this instruction is added to the set of FP cluster outputs. This is so that we can identify the points where we need to convert between floating point and fixed point.
The initial FP cluster set is the maximum set of instructions we can convert to fixed point. However, it may not be to our advantage to convert all possibilities to fixed point. Fixed-point conversion incurs resource overhead as shown in Figure 5 . If this overhead is greater than the logic savings of using an integer operator, then there is no reason to convert these operations. To determine the best logic savings, we create a resource 
C mul , C add , C sel , and C cmp denote the floating-point multiply, add, select, and compare instructions in the cluster. The resource usage ( A) after conversion to fixed point can be estimated by the following.
This equation assumes the existence of a function ResourceU sed which returns the logic resource usage of each operation. This resource usage includes both LEs used as well as the number of DSP blocks needed for each operation, since both metrics can be the limiting factor. This function is easily obtainable by running a few small sample circuits through the FPGA CAD tool to derive the characteristics of different floating-point and fixed-point cores. We can then run the resource estimation model with various scenarios, and pick the configuration that will yield the greatest resource savings. For example, the resource savings from converting FP multiplies to fixed point may not be substantial enough to offset the logic required to implement fptosi and sitofp conversions. However, we may still yield resource savings if the arithmetic tree is deep enough.
Fixed-Point Conversion
After the conversion set is selected, the inputs to this set are converted to fixed-point formats as shown in Figure 5 . This is performed by first multiplying the input by 2 FW L , and the LLVM fptosi instruction for type conversion.
Unlike floating-point operations, we must explicitly keep track of the IW L of every instruction to ensure functional correctness of the output. To do so, we compute the output IW L for every instruction within the set using Eqs. (4) and (5). When we create fixed-point implementations for each FP operation, we need to make sure the output IW Ls are consistent with the original IW Ls. If the IW Ls are not consistent, we must insert right shift (LLVM ashr) operators in the IR. For integer multiplication, the output data width is twice the input data width. We need to first right shift the result by W L, and then truncate (LLVM trunc) the result back to W L. After all fixedpoint units are created, the conversion back to floating point is done by inserting a LLVM sitofp instruction to convert the signed integer back to floating point, and a fdiv instruction to divide by 2 FW L . This procedure is illustrated by the sample C program code of matrix multiplication shown in Algorithm 4. The original FP cluster identified from the C program is shown in Figure 8 . The blue rectangles indicate the inputs and outputs of the conversion set. Figure 10 shows the resulting transformation after running our automatic conversion tool. This figure is simply a graphical representation of the LLVM IR. First, each input is multiplied with a constant value (2 FW L ) through the fcmul 3 block. It is then casted to a signed integer by the (int)(..) block. For an integer multiplier denoted by imul, we must first truncate its inputs to W L bits, before sign-extending it to 2W L bits. This is to allow proper computation. The 2W L output is then right shifted by W L bits. also needs to increase by 1 to avoid overflow. Thus, shift operations are performed on the operands before computation. After all the integer operations, the result is converted back to float via the casting instruction (float(...)) which is followed by a divide operation by 2 FW L with the f cdiv block. Note that in Figure 10 , there are a number of operations that shift data to the right (denoted by k). The reason why we need these shift operations is explained in greater detail in Figure 11 . Suppose two numbers with different IWL and FWL are to be added together. If added as is, the result will be incorrect. These right shift operations are necessary to align the decimal point of the number formats. Note that the IWL of the sum has increased by 1 according to Eq. (4).
As FP units cascade, the output IWL can increase very quickly. If the required IW L exceeds the total bit width (W L), we may not be able to perform fixed-point conversion on the entire floating-point cluster due to the loss of precision. To counteract this, our tool allows for arbitrarily large integer widths (up to 64 bits) in the fixed-point implementation.
Arithmetic Tree Balancing
The maximum number of operators in any given path in Figure 8 is 4. This is an example of a vine-like code structure that can unnecessarily increase the maximum IW L required. If the code is restructured such that Figure 9 is achieved, we can reduce max path in the circuit to 3. This both reduces the latency in the system, and gives better SNR since our worst-case IW L is reduced.
To do this, we created another LLVM compiler pass which attempts to balance arithmetic trees by taking advantage of commutative property of multiplies and adds. The goal of the balancing is to minimize the worst-case IW L. Note that this type of optimization is not performed by LLVM automatically since IEEE-754-compliant FP operators are not commutative due to possible differences in the least significant digits. However, our fixed-point conversion pass already introduces some quantization noise so any changes due to commutativity are quite negligible.
Profile-Guided Optimizations
So far in the IW L computation, we have assumed a default IWL value is selected for FP cluster inputs. However, static selection can be dangerous if data values exceed the representable range of the IW L. Consider a floating-point number, represented by sign, mantissa, and exp fields, which is to be converted into fixed-point representation This bit vector can then shifted to provide IW L integer bits using the following.
Finally the sign bit must be applied where the two's complement operator is applied to Fixed if the sign bit is 1. Note, that the floating-point value will overflow IW L bits if the value of (IW L − 1 − (exp − bias)) is less than 0. Our f ptosi core checks this condition and produces a signal which indicates that the floating-point value is outside the expected range. Our general approach is to signal an interrupt to the microprocessor whenever such an exception should occur. The processor can instead run this portion of code in software to ensure functional correctness. However, we would like to provide automation so that this situation is unlikely to ever occur. Thus we provide a methodology, based on profiling, for guiding the selection of the fixed-point parameters for the cluster inputs. It is extremely useful if some sample data can be obtained so that the minimum and maximum data ranges can be determined. We can then choose the IW L such that the entire range can be correctly represented. This does not guarantee functional correctness since it is possible that data can fall outside this range. However, as discussed earlier, we can detect these situations. By using representative sample data, we minimize the likelihood of data outside this representable range. To accomplish this, we have created another LLVM pass which inserts value profiling instructions directly into the IR. This is best illustrated using our initial dot-product example. After our value profiling pass is run, the IR, shown in Algorithm 5, contains the following instructions.
Our pass first inserts a global array named OptFloatRanges which has two entries per floating-point variable in the original program, to track the minimum value and maximum values seen. The IR has code inserted to track the minimum and maximum values of each variable using fcmp and select instructions. The IR in Algorithm 5 shows the instructions required to track the value of the %tmp8 variable. Before the program exits, we have inserted code (not shown) to dump the value of the OptFloatRanges array to a file. These values can be used by our fixed-point conversion pass to determine appropriate IW Ls for the FP cluster inputs. A value of the IW L which leads to representable fixed-point values that completely enclose the [min, max] range is selected.
The method of profiling value ranges is similar to traditional compiler optimizations involving feedback-directed compilation [Smith 2000 ]. These techniques use program transformations to attempt to optimize the runtime of the most common use cases for a particular program. Representative data inputs must be used to allow the compiler to select appropriate transformations. If the dataset is not representative of the true workload, then the program will still work correctly but may function slower than predicted.
In a similar manner, the use of value profiling allows for the creation of a highly efficient fixed-point accelerator on the FPGA. If the values of the true datasets fall outside the range that is able to be represented by the accelerator circuit, then this situation is detected and the computation is performed in software. In this case, the application will not be able to take advantage of the capabilities of the FPGA for acceleration.
The use of representative benchmark examples and experimental evaluation of error characteristics is common in the fields of audio, video, and image processing [Winkler implementation of the LLVM netlist, we implement each edge in the graph with 3 components: data, valid, and stall. The data edge is the operand value. A valid bit signals the data consumer when the data is valid. A stall signal is sent back from the consumer to the data producer if it cannot process the incoming data. When the stall signal is asserted, the stall propagates up the chain of blocks, thus halting the pipeline.
To ensure the resulting datapath can run free of stalls in the steady state, we adopt a simple ASAP (As Soon As Possible) clock cycle scheduling algorithm as shown in Figure 13. Since every operation has an associated latency, we can compute the minimum allowable latency required at every stage. For example, in our framework, an f mul has a latency of 5 cycles, and an f add has a latency of 12 cycles. The f mul block is implemented using hard Digital Signal Processing (DSP) blocks, which are very area efficient. In contrast, the f add block is implemented entirely in soft logic. To ensure a high operating frequency, circuits consisting of soft logic (FPGA logic elements) must be pipelined to a greater degree than those dominated by hard blocks. This leads to a significant difference in the latency of the floating-point adder and multiplier. Given these latencies, we can compute the minimum number of cycles before data is available at the output of a given block. This is shown on the right of the graph on Figure 13 . For this example, it takes 41 cycles to generate the result at the output. However, because paths through the graph do not have equal latencies, data may get to a block before it can be consumed. Thus, registers are inserted to add delay to some paths to ensure that the pipeline latencies are completely balanced. In this example, a 12-stage and a 24-stage register path are inserted into the graph to fully balance this pipeline. If the number of register stages is large, we automatically translate these into shift register implementations using block RAMs. 
RESULTS
To evaluate our automatic fixed-point conversion tool, we used a set of 6 benchmarks written in C. This set represents standard floating-point applications commonly used to evaluate algorithm accelerator devices. These designs include the following. The data inputs for these benchmarks are randomly generated single-precision floating-point values. After running through the default LLVM flow targeting a processor, the output can be saved as a golden reference. Next, we run these benchmarks using our automatic fixed-point conversion tool, which instruments the code to analyze the range of the input data. While still targeting the processor backend, we quickly obtain results for different fixed-point bit widths, ranging from 16 bits to 40 bits. We can then compare the SNR ratio of our tool at different fixed-point widths as shown in Figure 14 . As expected, the lower integer widths show more error in the output results, ranging from 0dB (sad) to 34dB (mm16). As integer widths increase, the SNR improves until plateauing at approximately 70dB. For most benchmarks, a total bit width of 32 is sufficient enough to achieve SNR above 60dB. However, for some designs that have more levels in the circuit, total bit widths of 36 is required. The trade-off of choosing the best number of total bits to satisfy an error criteria is guided by user trials running their application on the processor.
Once the appropriate bit width is chosen, the FPGA backend can generate Verilog accelerator blocks to implement customized datapaths for the clusters. To demonstrate the trade-offs, we generated HDL representations of all benchmarks for all bit widths tested. Verilog accelerators were created both before and after fixed-point conversion. The generated designs are compiled through Altera's Quartus II v11.0[Altera 2010] to synthesize and place and route each benchmark on a Stratix IV-530 device. From Quartus II, we can extract the total logic and registers used to implement each benchmark and the maximum operating frequency (F max ) of the final placed and routed circuit. The results of the original floating-point implementation are shown in Table I . These results are created using Altera's highly optimized floating-point IP cores and serve as our point of reference. The logic utilization of each benchmark is shown in Figure 15 , normalized to the number of combinational ALUTs used by the original floating-point core. From the graph, we can see that the fixed-point implementation of each benchmark substantially reduces the amount of logic required. On average, the fixed-point version is 38% (at 16 bit) to 63% the size of the original circuit (1.6x-2.6x times smaller). Figure 16 shows the total number of registers required by each benchmark, also normalized by the original floating-point core. We see great reductions in the number of registers used, ranging from 33% to 50% of the original register count (2x-3x smaller). In both logic and register usage, the 16-bit core is the smallest, and the resource usage increases linearly with respect to the total bit width.
Logic savings are heavily dependent on the topology of the benchmark. Designs with huge cones of FP units (fft) achieve the best resource savings. We see the least reduction for designs with many FP cluster inputs, but without a deep cone of FP units (cholesky).
Our conversion tool does not change the number of DSP blocks used because DSPs in 36 × 36 multiply mode are used for imuls. On the other hand, the RAM bits used in the original core were completely unnecessary in the fixed-point version. This is due to the dramatically reduced latency of the fixed-point cores (1 cycle) in comparison to the floating-point cores (5-12 cycles). In many cases, the pipeline balancing registers are so long in the floating-point case that it is necessary to implement them using embedded RAMs. Since fixed-point circuits use fewer logic resources, they present as simpler placement and routing problems. Thus, a higher operating frequency should be achievable. To illustrate, we compare the F max of the fixed-point benchmark with the original design in Figure 17 . To minimize the variation in placement due to the CAD tool, a sweep of 10 seeds was run, and the F max of the 10 compiles are geometrically averaged. The F max of each fixed-point circuit is then normalized by the F max of the original floating-point circuit. In most cases, we see that the F max of the fixed-point circuit is higher than the original circuit. On average, the F max of the fixed-point circuit is 13%-22% higher than the original core. There is no clear relationship between F max and the fixed-point bit width; there seems to be a general downward trend in F max as total bit width increases. The slight dip in F max at bit widths of 32 may be the result of a CAD anomaly.
Although we show good results from the presented techniques, the limiting factor in achieving greater gains are the conversion functions. For each input and output of the FP cluster, 220 ALUTs and 186 ALUTs are required for the fptosi and the sitofp instructions, respectively. For a number of applications, the programmer may use casting operations in C to convert from integer types to floating-point types. In the future, we plan to extend our clustering algorithm to go traverse backwards to these points so that the integer-to-float and the subsequent float-to-integer conversions cancel.
CASE STUDY: AUTOMATIC FIXED-POINT CONVERSION WITH HLS
Thus far in our discussion of automatic fixed-point conversion, we have focused the analysis on standard benchmark circuits. We automatically convert these floatingpoint benchmarks to fixed point, and compare its performance against the original implementation. To further illustrate the significance of our technique, we present a case study where automatic fixed-point conversion is used in conjunction with a HighLevel Synthesis (HLS) tool targeting FPGAs. First, the HLS tool, Altera's OpenCL Compiler, will be presented. We then discuss the results of using automatic fixed-point conversion within this tool.
OpenCL and Altera's OpenCL Compiler (ACL)
OpenCL is a platform-independent standard where data parallelism is explicitly specified. It is based on C and contains extensions to specify parallelism and memory hierarchy. The application, shown in Figure 18 , is composed of two sections: the host program and the kernel. The host program is the serial portion of the application and is responsible for managing data and controlling the overall flow of the algorithm. The kernel program is the highly parallel part of the application to be accelerated on a device such as a GPU or an FPGA.
At a high level, Altera's OpenCL Compiler [Singh 2011 ] translates an OpenCL kernel to hardware by creating a circuit which implements each operation. These circuits are wired together to mimic the flow of data in the kernel. For example, consider a simple vector addition kernel shown in Figure 19 . This kernel describes a simple C function where many copies of this function conceptually run in parallel. Each parallel thread is associated with an ID (get global id (0)) that indicates the subset of data that each thread operates on. For brevity, we do not include a detailed discussion of the OpenCL programming model. The interested should refer to Khronos OpenCL Working Group [2008] .
The translation to hardware will result in the high-level circuit structure shown on the right side of Figure 19 . The loads from arrays A and B are converted into load units which are small circuits responsible for issuing addresses to external memory and processing the returned data. The two returned values are fed directly into an adder unit responsible for calculating the floating-point addition of these two values. Finally, the result of the adder is wired directly to a store unit that writes the sum back to external memory. The most important concept behind the OpenCL-to-FPGA compiler is the notion of pipeline parallelism. For simplicity, assume the compiler has created 3 pipeline stages for the kernel as shown in Figure 20 . On the first clock cycle, thread 0 is clocked into the two load units. This indicates that they should begin fetching the first elements of data from arrays A and B. On the second clock cycle thread 1 is clocked in at the same time thread 0 has completed its read from memory and stored the results in the registers following the load units. On cycle 3, thread 2 is clocked in, thread 1 captures its returned data, and thread 0 stores the sum of the two values that it loaded. It is evident that in the steady state, all parts of the pipeline are active, with each stage processing a different thread. Figure 21 shows a high-level representation of a complete OpenCL system containing multiple kernel pipelines and circuitry connecting these pipelines to off-chip data interfaces. In addition to the kernel pipeline, ACL creates interfaces to external and internal memory. The load and store units for each pipeline are connected to external memory via a global interconnect structure that arbitrates multiple requests to a group of DDR DIMMs. Similarly, OpenCL local memory accesses are connected through a specialized interconnect structure to on-chip M9K RAMs.
Finite Differences in Fixed Point
To demonstrate our techniques on a complete example, we consider the 3D Finite Difference computation described in Micikevicius [2009] . The key computation in this application is a 3D convolution shown in Figure 22 . This stencil template is moved across each element in a 3D volume. Each box in the stencil is given a predetermined weight. The application of the stencil to an element in the volume is simply the weighted sum of neighboring elements.
The source-code for this application is freely downloadable from NVIDIA [2010] and written in OpenCL [Khronos OpenCL Working Group 2008] . We have incorporated the algorithm described in this article into Altera's OpenCL-to-FPGA compiler [Singh 2011; Czajkowski et al. 2012 ]. Altera's OpenCL compiler is also based on the LLVM compiler framework. This allows us to easily plug in our profiling and floating-to fixedpoint conversion passes and evaluate the benefits.
The results of three different variants are shown in Table II . First, Altera's default OpenCL flow is used where standard IEEE-compliant floating-point cores are used in pipelined fashion just as described in Section 5.1. Second, we use the OpenCL compiler with the fused datapath techniques published in Langhammer [2011] . Finally, we show the results of our fixed-point conversion algorithms.
The use of these techniques results in a 17% savings in ALUTs and 24% savings in registers. We feel that these ratios are representative of complete applications as circuits require more than just the core logic to perform computation. There is also a significant chunk of circuitry required to marshall data in an efficient manner from external sources such as SDRAM to the accelerators. The runtime results are reported for a volume of 384 × 384 × 384. Note that the highly pipelined implementation on an FPGA performs the stencil computation across the entire volume in 0.08s in all cases. In contrast, the floating-point algorithm runs on a single-core CPU (Intel Core2 Q9950 at 2.83 GHz) completed in 21s. The fixed-point implementation on the CPU did not yield much runtime savings, since this algorithm still took 20.6s. Our fixed-point FPGA implementation represents a 257x-262x speedup over CPU runtime. We also compared the results from the fixed-point implementation with the original floating-point cores, and determined that our fixed-point implementation yields a signal-to-noise ratio (SNR) of 125 dB. This is well within the required tolerance for this application.
CONCLUSION
We have implemented an automatic, profile-guided, fixed-point conversion tool in LLVM which converts floating-point operations to fixed-point to reduce the resources required to implement these computations in customized hardware. This tool offers programmers the opportunity to prototype and perform design space exploration on the microprocessor while creating an efficient FPGA design for algorithm acceleration. This alleviates the long turnaround times often experienced by large FPGA designs. Through our benchmarks, we demonstrate that fixed-point conversion can yield great resource savings, up to 2x-3x reductions in logic and registers used. We also minimize embedded RAM usage, and achieve an average of 13%-22% higher F max than the original floating-point implementation. Our flow uses a novel technique of adding instrumentation code directly into the compiler IR and using that to guide fixed-point code transformations. The transformed IR can be executed directly on a standard microprocessor to evaluate various trade-offs. We have demonstrated that high SNR can be achieved by performing automatic data range analysis and varying total fixed-point bit widths. All steps are completely automatic; we present a set of SNR, area, and operating frequency trade-offs allowing the designer to control the operating point that best meet her requirements.
To illustrate the effectiveness of our conversion tool, we incorporated it into a HighLevel Synthesis (HLS) tool that targets FPGAs. In our case study using a 3D finite differences algorithm, we show that our methods reduced logic and register usage by 17% and 24%, respectively. This incurred no performance penalty, and the SNR remained very high at 125 dB.
An interesting future research area would be to combine the techniques shown in this article with Altera's Fused Datapath Compiler [Langhammer 2011] . The use of value profiling along with reduced dynamic range floating-point (variable sized fixed-point mantissa along with a small dynamic exponent) may lead to even more substantial savings.
