ABSTRACT
In order to help designers refine their code from a simulation model to a synthesizable behavioral description, we are trying to efficiently synthesize the full A NSI C standard [23] . This task turns out to be particularly difficult because of dynamic memory allocation, function calls, recursions, goto s, type castings, and pointers. The problem with dynamic memory allocation ( malloc , free ) and recursion is that the size of the memory required for an application is a priori unknown. Therefore, the synthesis of C code involving dynamic memory allocation would require the access to an operating system running in software or the generation of hardware allocators [44, 53] . Arbitrary control flow (e.g. due to goto statements) complicates the scheduling of operations even though it has been addressed [49] . In general, the use of pointers is one of the major difficulties especially when combined with pointer arithmetic and type casting. Pointers have different applications in C. They are often used in function calls to pass parameters by reference. They are also used to scan arrays, reference data structures or perform any type of complex memory management operation. The semantic of pointers in C is the address of data in main memory. However in hardware, designers may want to optimize the memory architecture by using registers, multiple memory banks, etc. Therefore, pointers cannot be considered as addresses to a single memory. To enable efficient mapping of C code with pointers to hardware, the synthesis tool has to automatically generate the appropriate circuit to access the data referenced by pointers. The resolution of pointers is a key feature for Cbased synthesis. It is an enabler for fast data accesses and efficient scheduling of operations.
In this paper we will focus on the efficient hardware implementation of pointers in C models. In Section 2, we present some of the related work on synthesis from C as well as on compilation of C code onto parallel architectures. In Section 3 and 4, we define our synthesizable subset of C and show how various types of pointers can be synthesized. In Section 5 and 6, we discuss different techniques for optimizing the code by limiting the number of live variables before the loads and stores and encoding the value of the pointers. In Section 7, we present SpC, our framework for the synthesis and optimization of C code with pointers using the SUIF compiler framework and a commercial behavioral synthesis tool. Finally, in Section 8, results are given for a set of examples.
RELATED WORK

Hardware synthesis from C/C++
Different subsets of C/C++ and C-like HDLs have been defined and used for synthesis. We mention first those developed in the eighties. H ARDWARE C [26] is a language with a C-like syntax and a cycle-based semantic. It doesn't support pointers, recursion and dynamic memory allocation. H ARDWARE C can be fully synthesized.
C ONES [47] from AT&T Bell Laboratories is an automated synthesis system that takes behavioral models written in a C-based language [6] and produces gate-level implementations. Here, the C model describes circuit behavior during each clock cycle of sequential logic. This subset is very restricted and doesn't contain unbounded loops nor pointers.
In the recent past a few projects have been looking at means to use C/C++ as an input to current design flows [12] . Constructs are added to model coarse-grain parallelism, communication and data-types. These constructs can either be defined as new syntactic constructs, hence creating a new language. They can also be implemented as part of a C++ class library [69, 56] . Even though restrictions on the language apply for synthesis, software/hardware systems are then modeled directly using C++. Simulation is performed by running the executable generated after compiling the models. Standard debugging environments can then be used to check the functionality of the system.
For reactivity, S YSTEM C [29, 69] (formerly known as S CENIC [28] from Synopsys) supports a mixed synchronous and asynchronous approach implemented as a C++ class library. The Esterel C language (ECL) [27] from Cadence is synchronous as it is based on both C and E STEREL . Other extensions include H ANDEL -C [59] and B ACH C [22] originally based on O CCAM , S PEC C [65] based on S PEC C HART , CynLib from CynApps [56] , and Clevel Design [54] .
In order to map functionality to hardware, a synthesizable-C/C++ subset is usually defined. We can distinguish two approaches. The first approach consists of translating a subset of C into HDL (Verilog or VHDL) that will eventually be synthesized using today's synthesis tools. The second approach consists of using C/C++ directly as an input to behavioral synthesis.
In order to facilitate the mapping of C models into hardware, several tools exist that automatically translate Cbased descriptions into HDL either at the behavioral level or the register transfer level (RTL) level. In the original B ACH C compiler, a limited subset of C can be translated into VHDL at the behavioral level. C O W ARE [55] , OCAPI [41, 62] , C YN A PPS [56] and others [57, 70] automated the translation from a refined RTL model to HDL.
These subsets don't include pointers.
Kim and Choi [24] as well as the authors of this paper [42, 44] were the first to report on the synthesis of hardware C models with pointers. Kim and Choi's implementation is limited to a rather small subset of C. Pointers that may point to multiple locations are not supported and such constructs as type casting and complex data structures are not considered. Two commercial tools, C-Level Design C2HDL [51] and Frontier Design AR|T B UILDER [58] , also provide tools for translating C models into Verilog or VHDL. Limited scheduling and resource-sharing techniques can be applied to quickly generate RTL synthesizable code. Pointers are one of the limitations for AR|T B UILDER . Pointers are only supported to pass parameters by reference or to scan arrays (pointer arithmetic).
These types of pointers can usually be removed using standard compiler techniques (propagation and function inlining) and by adding ports for procedures. C2HDL on the other hand, supports all of the ANSI C constructs excluding libraries. However, pointers are implemented in a software-like approach. They are only considered as addresses to data stored in memories, which requires the allocation of memories, to store the various variables, and addressing units. In hardware, designers may want to optimize the locality by storing data into multiple memories, registers or even wires (e.g. output of functional units). Our tool SpC presented here enables such optimization by leveraging recent researches on pointer analysis and high-level synthesis.
Another approach is to use C/C++ directly as an input to architectural synthesis tools. This approach has been chosen by Synopsys with C O C ENTRIC S YSTEM C C OMPILER [19, 68] and by NEC with C YBER [49] . C and C++ are both procedural imperative languages. Their semantic relies on an implicit Von Neuman architecture. The implementation of sequential functional descriptions into hardware has extensively been studied during the last decade [18, 26, 17, 25, 8, 60, 67] . Synthesis from C/C++ description can leverage some of this previous work on architectural synthesis but also requires the development of some extensions for efficiently supporting the different constructs of C/C++. Some of the current work on function calls as well as synthesis of structures in VHDL can also be relevant. More research is however required for supporting C/C++ constructs such as pointers, dynamic memory allocation, and object oriented features.
Finally, we should also mention some of the areas in which C/C++ models mix hardware-software and other specific architectures. For hardware-software codesign, the C O W ARE N2C system [55] as well as its precursor [5] use C/C++ as a language base for system specification. Additional constructs have been introduced to define concurrent processing blocks and communication. This description is used to synthesize the interfaces between the blocks. C OSYMA [16] uses C * , another superset of C with processes and timing constraints. During hardware synthesis, functions are inlined and pointers are only treated as memory references.
For synthesis of reconfigurable systems based on field programmable gate array (FPGA), several projects have been using C/C++. For PAM-B LOX [33] , a bottom-up methodology is presented in which a library of components can be defined and used as C++ objects to build systems for the Pamette architecture. A similar design environment has also been developed based on C for S PLASH [20] . For mixed software and reprogrammable FPGA architectures, the G ARP compiler [7] as well as the N IMBLE compiler [32] automatically generates retargetable co-processors to speed up loops. Pointers are treated as references to the main memory. This approach is relevant for implementing memory-mapped I/O. However, it can be a limitation to parallelize data transfers inside of a datapath. Finally, Babb et al. [2] present a compiler for a variation of the RAW parallel architecture in which one or multiple processing units can be replaced by specialized hardware blocks. The problem of pointers is addressed in order to map data to different memory tiles. Pointers to multiple memory locations are however a limitation as these locations are mapped to a unique memory and therefore cannot be accessed in parallel in a datapath.
To summarize the previous work, pointers are one of the main outstanding issues for the synthesis of hardware from C. In order to guarantee a good quality of results, the current practice is to support only a limited subset of the language with severe restrictions on pointers. Otherwise, a software-like approach is taken, in which the data accessed by pointers are stored in memory. Our approach is based on the use of analysis techniques ( pointer analysis ) in order to generate efficient hardware from C code using any kind of pointers at the behavioral level.
Software compilation of C and C++
C and C++ are two of the most commonly used programming languages today. Many compilers exist for many different architectures. Most of the recent compilers not only try to map the different statements of the code into assembly instructions, but they also try to optimize the code for a given instruction set architecture (ISA). For distributed architectures, parallel compilers are trying to partition programs into multiple threads running in parallel. However, some of the C constructs such as pointers, arbitrary control flow operations ( goto , longjmp , etc.) make these optimizations difficult. In software, pointers represent addresses in memory. They are often used to pass parameters by reference, access array elements, address dynamically allocated memory and managing the memory. By definition, pointers may reference multiple data. Such happens when referencing the different elements of a data structure or of an array. It may also happen inside of a function for pointers corresponding to parameters passed by reference or, more generally, when the value of the pointer at one point in the code varies according to the current context or the previous flow of operations.
Many of the optimizations done in today's compilers as well as in many high-level synthesis tools are based on data-flow analysis [1, 34] . The purpose of data-flow analysis is to provide information on how a code segment manipulates its data. Examples of applications include register allocation (based on reaching-definition and livevariable analysis), constant folding, common-subexpression elimination, loop optimization, dead-code elimination, etc. The optimizations presented in Section 5 are also applications of data-flow analysis. To solve a given data-flow problem, the effect of each programming language structure is modeled by transfer functions. The result of such transfer functions often depends on the data accesses at each statement in the program. Namely, to model the effect of statements involving a pointer, it is important to know what data may be accessed by the pointer ( points-to information ).
In order to parallelize programs onto distributed architectures, the independent sets of data which can be processed in parallel have to be extracted [30] . The problem here is to find statements in the program that may read or write the same locations (aliasing problem). For this purpose, the aliasing information has to be determined between pointers. The points-to information and the aliasing information are equivalent and can be determined by recent analysis techniques called pointer-analysis or alias-analysis . Different pointer-analysis techniques [50, 51, 37, 46] exist. For hardware synthesis, we also need to know which variables are accessed at each statement.
Therefore, pointer analysis can be used for the behavioral synthesis of C models as we will do in the next section.
BACKGROUND: SYNTHESIS OF C MODELS WITH POINTERS
In software, a C program is targeted to a virtual architecture consisting of one memory in which all data are stored. The semantics of pointers is the address of an element in memory. Even though register declarations may allow programmers to specify the variables to place in registers, the assignment of variables to registers is generally done by the compiler. The notions of caches and memory pages are transparent to programmers.
In hardware, at the behavioral level, designers want to have control on where data are stored and want to optimize the locality of the storage. Typically, a chip design contains multiple memory banks, register files, registers and wires. To efficiently map C code onto hardware, the storage space must be partitioned. During synthesis, each partition is then mapped to a register, a wire, or a memory. Some partitions may also represent pointers. Pointers may be used to reference any variable no matter where its information is available. Pointers are then considered as references: references to memory elements, registers, wires, or ports. They can be used to access data. In this paper we call the action of reading data using a pointer a load . Subsequently, a store is the action of writing data using a pointer.
The synthesis of hardware from C consists first of partitioning the memory. Each partition is then mapped to a variable (akin to wire or register in the final implementation) or an array (akin to memory or register file). The synthesis of pointers consists of generating the appropriate circuit for accessing data. For this purpose, we change the addresses into numbers (i.e. encode pointers' values) and replace loads and stores by some assignments directly accessing the data the pointer may reference (i.e. dereference pointers). As we can see in Example 1, in order to efficiently map C code into hardware, we first need to partition the memory. In our implementation, memory is partitioned into a set of location sets as described in Section 3.2. Subsequently, to synthesize load and store operations into hardware, we need to know at compile-time the set of locations the pointers may reference (points-to information). As we have seen in Section 2.2, such information is also widely used in compilers and can be determined by recent analysis techniques called pointer-analysis or aliasanalysis described in Section 3.3. Finally, in Section 3.4, we present how memory can be partitioned into variables and arrays which can be mapped to hardware.
Definition of the subset
The ultimate goal of this research is to efficiently synthesize the full ANSI C. In this work, however we target mainly the synthesis of pointers to statically allocated data and explore different optimization techniques. Extensions of this work to include more of the C syntax (malloc/free) are possible [44, 45] but beyond the scope of this paper. In this section we only talk about the restrictions on the synthesizable subset. Limitations on the generated architecture may also exist akin to the limitations of the behavioral synthesis tool used as a back-end to our tool.
Our subset contains all statements supported by today's behavioral synthesis tools including branches, loops, assignments, etc... It also contains pointers to data which can be stored in multiple memories, registers, or wires.
It supports pointers to statically allocated data, such as variables, arrays, and structures, pointers to pointers, and pointers to functions. Since memory blocks are instantiated at compile time, recursions and pointers to dynamically-allocated memory which size is unknown at compile time are not allowed. This implies that, in general, malloc, free and recursions are not supported. Nevertheless, malloc followed by free could be allowed as well as tail recursion. Calls to malloc followed by free can be treated as local variables [44] and tail recursion elimination can be done by turning recursions into loops [34] .
The pointer analysis techniques and the memory representation presented in the next sections support the complete ANSI C syntax. In this paper, however, we define our own synthesizable subset. Our subset includes all types of pointers and type casting. The code is assumed to be correct. Tools such as Purify [64] or LCLint [63] can be used to check that memory reads and writes are valid. Besides, we set the following restrictions.
One restriction applies to systems described as a set of parallel processes: pointers that reference data outside of the scope of a process (e.g. global variables or data internal to some other processes) are not allowed. Their resolution would require the synthesis of some kind of interface between the circuits realizing the processes. Such interface is usually defined during system partitioning and, hence, before synthesis. 
Memory representation
The simplest memory representation consists of a single address space in which all data are stored. This trivial representation however prevents from optimizing the locality and parallelizing the code. On the other hand, the most accurate representation, which would distinguish each element of arrays or of recursive data structures, is not practical for large programs. For simple data structures (arrays, structures, array of structures), offsets are used to identify the different fields of structures whereas strides are used to record array-element sizes. Figure 1 gives an example of representation for an array of structures. The representation doesn't distinguish the different elements within the array but it distinguishes the different instantiations of variables and structures. This makes sense since all elements of an array are usually alike. Nested arrays and structures, type casting and pointer arithmetic are making things more complicated, leading to some additional inaccuracies.
The representation of the memory itself depends on how locations are being accessed. Consequently, pointer analysis which is the subject of the next section and memory representation are tightly related.
Pointer analysis
Pointer analysis is a compiler pass to identify at compile-time the potential values of the pointers in the program. This information is used to determine the set of locations the pointer may point to. With the memory representation of Section 3.2, this set of locations is actually a set of location sets. For synthesis, in the case of loads and stores, we want to synthesize the logic to access or modify the location referenced by the pointer. For this purpose, the points-to information must be both safe and accurate: safe because we have to consider all locations the pointer may reference and accurate because the smaller the points-to set is, the less logic we have to generate. We can distinguish two types of analyses. comes from features such as dynamic memory allocation, recursion and recursive data structures that we do not consider in this paper.
The flow-and context-sensitive analysis is more appropriate for hardware synthesis. In our case, the complexity of the analysis is not an issue, and the coding style for modeling hardware leads to accurate results.
Our implementation uses a flow-and context-sensitive analysis. Using the memory representation described in the previous section, the points-to information is defined as a set of location sets. The points-to information is then used to encode the pointers' value and to generate the appropriate logic for accessing the data in each location set. 
Memory partitioning and mapping to variables and arrays
After analysis, the storage in the program can be represented as a set of distinct locations sets. This set of location sets represents a partitioning of the memory. Each partition block (i.e. each location set) is ultimately mapped to a wire, a register or a section of memory in the final design. The allocation of a given variable to a register (or a wire) is typically the result of architectural synthesis. We can distinguish two types of location sets for statically allocated data: location sets whose strides are null (i.e. singletons, sets of one location), and location sets with non-zero strides (i.e. sets of multiple locations). A singleton location set may therefore be treated as a simple variable, whereas a location set with non-zero stride may be mapped to an array. In our implementation [45] , for each location set <loc, f, s>, we define SPC_loc_f_s as follows.
For a singleton location set (i.e., s null), SPC_loc_f_s is a variable. In the case of a location set representing a variable of basic type (e.g. char, short, int) the mapping is straightforward. For structures, their different fields can be mapped to separate variables (akin to registers or wires in the final hardware) as long as they are represented by separate location sets.
For a location set with non-zero stride (i.e. s not null), SPC_loc_f_s is defined as an array (e.g. array of integers). Such array may then typically either be mapped to a memory or a register file manually or according to current methodology [9, 36] . For arrays of structures, the different fields of the structures can be mapped to different memories as long as their representations do not overlap. This allows to independently access the different fields of the structures, leading to more flexibility and potentially better performances. The partitioning process can be more complex with type casting and out of bound array accesses [45] . Nevertheless, after memory partitioning, the storage of the C program can be represented as a set of distinct variables and arrays. Therefore, in the rest of this paper, all data are supposed to be either variables or arrays. In the next section we present how pointers to variables and arrays can be synthesized. For clarity, variables and arrays such as SPC_a_0_0 and SPC_table_0_4[...] will be denoted a and table directly.
POINTER SYNTHESIS
In hardware, as discussed in Section 3, data may be stored in multiple registers, memories or even wires (e.g.
output of a functional block). Therefore, to efficiently map C code into hardware, pointers may not only address data in memory, they may also reference registers, wires or ports. Pointer analysis is used to define the set of locations, as a set of location sets, each pointer may point to. Our synthesis tool generates the appropriate circuit to dynamically access these locations according to the pointers' value. We distinguish two types of pointers: pointers to a single location, which can be removed, and pointers to multiple locations.
Loads from pointers to a single location (i.e. one location set whose stride is null) are simply replaced by assignments from the location accessed. Similarly, stores are simply replaced by assignments to the location referenced. Loads and stores from pointers to multiple locations (i.e. many location sets with zero strides and/or one or more location set with non-zero stride) are replaced by a set of assignments in which each location may be dynamically accessed according to the pointer's value. For the sake of clarity, we will use the variable name p as a generic pointer name.
Encoding the value of the pointers
The addresses (i.e. pointers' values) are encoded. The encoded value of a pointer p consists of two fields: the tag p.tag corresponds to the location set referenced by the pointer and the index p.index stores the number of bytes corresponding to the offset of the data referenced within the location set.
The tag p.tag is only used for pointers to multiple location sets. Its size (defined as the minimum number of bit used to store its value) can be as small as . The index p.index on 2 size_of_point-to-set ( ) log the other hand is used when the pointer p may point to a location with non zero stride (e.g. an array). Pointer arithmetic is then supported by changing the value of the index: the value of p.index is initialized when p gets the address of the array element. Then, the index is modified instead of p.
For pointer variables, these two fields can be implemented as separate variables p_tag and p_index. The encoding of the pointers' value has an effect on the complexity of the design. Example 6 gives two examples of encodings that produce different implementations for the assignment of two pointers. In Section 6, the encoding problem is formulated and a heuristic solution is presented.
Dereferencing the pointers
Several types of pointers can be distinguished. We have seen in Section 3.4 how complex data structures can be represented as variables and arrays. Without loss of generality, in this section, we first consider pointers that may point to variables and array elements. We then present two extensions for pointers to pointers and pointer to function.
Pointers to variables and arrays
We use the result of pointer analysis to remove loads and stores. With the assumptions of Section 3.1, loads and stores can be replaced by branching statements (e.g. case, if then else) at compile time. Pointer analysis defines the set of location sets the pointer may reference at each load and store. When these location sets are mapped to registers or wires (e.g. output of a functional unit), the branching statements corresponding to a load are implemented using a multiplexer controlled by the pointer's value. In the case of a store, some control logic is generated to update the value of the variable the pointer points to. This control logic can be automatically generated by an architectural synthesis tool. References to array elements stored in memories or register files are treated similarly. Some control logic is also created to access the location referenced in the different memories or register files. The corresponding circuit generated after synthesis is presented in Figure 3 . Note that the load (...=*p) is implemented by a 2 input multiplexer controlled by p_tag.
The removal of the dereferences '*' in loads and stores can be done in one pass. For each load (...=*p), we look at the points-to set of the pointer at this instruction. If the points-to set is only one location, the load is simply replaced by an assignment from this location. Otherwise, we create a temporary variable (star_p in Example 7) that stores the value of the data the pointer points to at the load instruction. The load instruction is then replaced by an assignment from this temporary variable. Branching statements are inserted before the load to set the value of the temporary variable star_p according to the values of the tag p_tag and the index Similarly, for each store (*q=...), we also look at the points-to set of the pointer q at this instruction. If the pointer points to only one location, the store is simply replaced by a assignment to this location. Otherwise we create a temporary variable (tmp_q in Example 7) that stores the value to be assigned to the data q points to. The store is then replaced by an assignment to this temporary variable and branching statements are inserted after the store to update the values of the variables q may point to according to the tag q_tag and index q_index.
This implementation can be generalized to pointers to pointers and pointers to functions. In Section 5, we also present some optimizations to reduce the memory usage before loads and between loads and stores when the pointer is a variable.
Generalization to other types of pointers
In general, pointers may also point to other pointers and functions. The technique presented in the previous section can be extended to these types of pointers.
Pointer to pointers
Pointers to pointers can be implemented by resolving the pointers level by level. 
Pointer to functions
Pointers to functions are resolved in a straightforward manner after pointer analysis. The synthesis of the functions themselves is then performed according to the synthesis tool (e.g map to component, inline...). In our implementation, functions are inlined before synthesis.
OPTIMIZATION OF LOADS AND STORES
In the previous section, we have seen how pointers can be removed after pointer analysis. Now, we optimize the code for hardware synthesis. First, we present techniques to reduce the amount of storage necessary before loads (...=*p) and stores (*p=...) when the pointer p is a variable.
In this section, the following assumptions are made. The pointer p is a variable. Its points-to set consists of a set of variables (mapped to registers or wires). The optimizations presented here are only performed when the previous assumptions hold. Their generalization to loads and stores from pointers within an array or pointers pointing to array elements is beyond the scope of this paper.
page 18 The goal of the optimizations presented here is to reduce the number of live variables 1 before loads and stores. When variables are stored in registers, the number of registers used in a given program corresponds to the maximum number of variables live at a clock boundary. The direct effect of our optimizations is therefore to reduce the number of registers used in the design. Besides, synthesis tools may also take advantage of having less live variables before loads and stores to improve performance by more efficiently reusing registers.
Optimization of Loads
By definition, a load may read any variable of the points-to set. It also uses the value of the pointer to select which variable is actually read. This implies that all variables of the points-to set and the pointer variable are live before the load. However, only one variable is really necessary: the variable the pointer points to. The issue is then to define star_p in such a way that the number of live variables is reduced. In our implementation, each load is replaced by assignments from star_p. The variable star_p itself is defined each time p or any variable in the points-to set is modified. Dead-code elimination [1, 34] is then performed to remove all unnecessary definitions of star_p.
However, the early definition of star_p may also increase the number of live variables. When all variables of the points-to set are live, star_p is just a copy of one of these variables and therefore is not necessary. So, in 1.A variable is live at a particular point in a program if there is a path to the exit along which its value may be used before it is redefined (i.e. killed). It is dead if there is no such a path [1, 34] . 
Optimization of Stores
In this section, we try to apply the same idea of creating temporary variables to reduce the number of life variables before stores. p be a pointer that may point to a, b, or As we have seen in Example 12, the number of live variables before a store can be reduced by at most one.
Example 12. Let
The reason is that the store needs all variables of the points-to set (that are live after the store) except the variable p points to. For this purpose, given a pointer p and the size of its points-to set pts_size, we define the following class of variables:
{ _starN_p, for N in {1, 2, ..., (pts_size-1)} ("_starN_p" stands for "not star_p"), variables whose values are equal to the values of the variables in the points-to set p does not point to.
Note that each _starN_p can be defined in such a way that it may only store the value of either variables of a fixed pair as shown in Example 13. To optimize the number of live-variables before stores, let us first consider an adaptation of the algorithm described in Section 5.1. Indeed, one could imagine an algorithm where the _starN_p variables are used at each store and defined when p or any variable of the points-to set is modified. Since each _starN_p variable can only store the value of one of two variables of the points-to set, they should be killed each time one of the variables of the points-to set is live. For hardware synthesis, this creates a lot of logic to control their value, which turns out not to be very practical.
In our implementation, we take a conservative approach by optimizing stores only in the case of a load followed by a store. Such a case happens after inlining functions in which the parameters passed by reference are both read and written within the function. 2) List the loads post-dominated 2 by stores from the same pointer (implemented as a backward data-flow analysis [1, 34] ).
3) Do live variable analysis assuming that each store in the list generated at Step 1 kills all variables in the points-to set.
4)
If, for all loads in the list generated at Step 2, none of the variables in the points-to set are live:
-define star_p and the _starN_p variables before the loads and when p, or any variable of the points-to set changes between loads and stores;
-use star_p and the _starN_p variables to update the values of variables in the points-to set after the stores. Even though this optimization reduces the number of live variables before stores by at most one, it helps reducing the number of registers. There is however a trade-off between the number of registers used and the amount of steering logic. This optimization can be performed while optimizing the loads, as we will see in Section 7.
ENCODING OF POINTERS
In software, the pointers' values represent addresses in memory. These values are used in loads and stores, they have a fixed size and can then be assigned (p=q) or compared (p==q). In hardware, we want to reduce the size of the storage and the complexity of the decoding logic in loads and stores. In Section 4.1, we have seen that the encoding of a pointers consists of two fields, a tag and an index. In this section, we are trying to encode the tag part more efficiently. Other techniques similar to the encoding of memory addresses [4, 36] could be used to encode the index part, they are not addressed in this paper.
Definition 3. We define the size of a pointer as the bit-width of its tag.
When the size of the pointer is decreased, the number of bit registers used to store its value is also reduced.
The decoding logic for loads and stores is also simplified. We have seen that a load can be implemented as a multiplexer controlled by the pointers' value (tag part). Reducing the pointers' size simplifies also the complexity of the decoding logic for this multiplexer. However, as we have seen in Example 6, when pointers are assigned or compared, we may have to add case statements to "translate" the values of the pointers by means of some combinational circuit. We can use encoding techniques to minimize the size of these circuits. Our goal is twofold: 1) we want to encode each pointer with the minimum number of bits in order to minimize the storage as well as the decoding logic for loads and stores; 2) we want to minimize the logic related to assignment and comparison of pointers.
We will first present the problem of pointers' encoding. The exact solution to this problem leads to what we call a local encoding in which two pointers that point to the same location set may have different encodings. This problem is, however, hard to solve and a heuristic is then introduced in which two pointers that point to the same location set share the same encoding. This gives a global encoding of the pointers' value. In order to get closer to the exact solution, corresponding to the local encoding, two optimizations are then presented called splitting and folding. These optimizations can be seen as adding "locality" to the global encoding.
Definition of the Problem
In this section, we present the problem of encoding the value of the pointers. Our first goal is to minimize the size of the pointers. Then, when a pointer is assigned or compared to another pointer, we want the corresponding tags to be equal (e.g. p_tag=q_tag) or "as close as possible" to each other. If two tags have different bitwidth, one tag can be equal to a subfield of the other. Assignments would then be performed by concatenating or removing bits, whereas comparisons would only be executed on subfields of the two codes. This reduces the size of the circuit that translates or compares the tags while keeping the number of bits to a minimum.
Definition 4. For two pointers, and , the pointer dependence relation is 1 if and only if the two pointers are assigned or compared (otherwise it is 0).
Definition 5. The pointer-dependence graph is an undirected graph in which the nodes are the pointers and the edges are the relations between the pointers. An edge between two nodes is defined when the two corresponding
pointers are assigned or compared.
Example 15. Consider the following code segment:
int *r1, *r2, *r3, *q1, *q2; ... if(i==0) { r1=&a; r2=&b; r3=&c; } else { r1=&b; r2=&c; r3=&d; } if(j==0) { q1=r1; q2=r2; } else { q1=r2; q2=r3; } ... 
In this example, we consider the pointers {r1,
The encoding problem can be stated as follows. For each pointer we represent its points-to set as a set of symbols corresponding to the location sets the pointer may point to. Thus, we have an ensemble of sets of symbols and the dependencies among the sets represented by the pointer-dependence graph. The problem consists of encoding the symbols in the sets. The constraints on the encoding are two: 1) the supercube 1 of the codes of the symbols in each set must have minimum size; 2) the symbols that correspond to the same location set in two dependent sets must be encoded as close as possible. The reasons for the first constraint are to minimize the number of bits to store and to reduce the decoding logic for loads and stores. The reason for the second one is to reduce the size of the combinational circuit implementing pointers' assignments and comparisons. 
Figure 8a shows an example of a non-optimal encoding. The encoding technique used here is a straightforward minimum-length encoding in which the value 0 is assigned to the first variable in the points-to set, 1 is assigned to the second variable of the points-to set, etc.. This encoding is not optimal, some logic has to be added in the circuit to implement the assignments q2=r3 and q1=r2 as shown on Figure 8_a.
To find an optimal encoding, we look at the dependence between the pointers. Pointer q1 may take the value of r1 or r2. So we want the codes of r1 and r2 to be subfields of the code of q1. Similarly, q2 may take the value of r2 or r3. We want the codes of r2 and r3 to be subfields of the code of q2. An optimal encoding verifying these properties is shown on Figure 8_b . For r1, value 0 is assigned to a and value 1 to b. For r2, 0 will be assigned to b and 1 to c. As a result, q1=r1 will be replaced by q1_tag={0,r1_tag} and q1=r2 will be replaced by q1_tag={r2_tag,1} (where {,} is the concatenation operator).
Problem Formulation
Let's consider P pointers . For each pointer , let be its points-to set. The points-to set is a set of symbols , where each symbol is associated with a location set. We define the set of the encoded symbols of the points-to set . The encoded values of the symbols in each set are noted .
Definition 6. Two sets and are said to be dependent if their associated pointers are dependent (Definition 4).
1.The supercube of a set of cubes is the smallest cube containing all the cubes in the set [10] .
Our first goal is to minimize the number of bit registers as well as the size of the decoders required to store and decode the pointers' values. We want to minimize the dimension of the supercube of the encoded symbols in each set. This minimum is achieved when the sum of the dimensions of the supercubes is also minimized:
(1)
Example 17. In the encoding presented in both Figures 8a and 8b, =1+1+1+2+2=7 is minimum.
When two pointers are assigned or compared, we also want to minimize the size of the circuit implementing the translation of the codes. For this purpose, the distance between encoded symbols in two dependent sets has to be minimum: (2) where is the distance between the two encoded sets. When the pointers have the same points-to set and the encoding has the same length n, is defined as: 
where N= = is the number of symbols in the points-to sets, is in the set of the permutation functions of n bits, and is the Hamming distance. Note that the two equal points-to sets may have different encodings.
In general, the points-to sets may differ and their encoding may have different lengths. The computation of the distance is then more complex. For example, the distance between two sets whose encodings have different lengths can be computed by padding the shorter codes with 0s or 1s. Then, if the points-to sets and differ,
we are only interested in the distance between the encoding of the symbols common to the two points-to sets.
Our goal is to minimize Eq. 1 and Eq. 2. There is a trade-off between the storage area (number of registers) and the amount of logic used to translate the codes. For example, one may optimize the size of the pointers keeping the amount of logic minimum by minimizing first Eq. 2 and then Eq. 1. In general, we can cast the problem as follows: (4) where is a coefficient between 0 and 1.
Since this problem is computationally hard to solve, we use heuristics.
Simplified problem 6.3.1 Formalism for a Global Solution
In the general formulation of the problem presented in Section 6.2, different codes may be associated with the symbols in each set. Therefore the encoding has to be found locally, for each set. The problem can be simplified by constraining all symbols associated with the same location set to share the same code. The encoding is then found globally for all the symbols that correspond to the same location set in the points-to sets. The final encoding values of the pointers is then found by picking the relevant bits (i.e. the bits that are not identical for the different encodings of the symbols in the points-to set). and 0 for r2.
Figure 8b gives an example of a better global encoding. The encoding is global because the pointers initially
share the same encoding shown in Figure 9 . For a global encoding, minimizing Eq. 2 is then irrelevant because the distance between the codes of the symbols that correspond to the same location set in the different points-to sets is null (i.e.
). The complexity of the logic to perform assignments and, to some extent, comparison is then minimal. However, the size of the pointers may vary and affect the size of the decoding circuit in loads and stores. Our goal becomes to minimize Eq. 1 only.
For this simplified problem, it is convenient to consider the symbols (i.e. location sets) in the union of the points-to sets. These symbols will be denoted: . The size of the problem is reduced: instead of dealing with O(P*N) symbols we only deal with N symbols , where N is the number of location sets. We use now a formalism that has been used to solve other encoding problems [11, 48] . For example, the first row of the matrix shows that r1 may point to a or b.
Definition 7. The relation matrix A is defined as the matrix in which the rows represent the points-to sets and the columns the symbols. Entry of
We search for an encoding matrix E. Namely, each row in A corresponds to a points-to set. 
to the constraint expressed in Eq. 1. This problem corresponds to the input encoding problem [11, 10, 48] if the 0s in matrix A are replaced by don't cares (i.e. *). In other words, our problem is a simpler instance of the general input encoding problem.
Global Encoding Algorithm
The problem of input encoding has been extensively studied [3, 15, 11, 35, 38, 39, 40, 48] . We use an approach reminiscent of MUSTANG [35] and POW3 [3] .
Definition 8. An affinity graph is an undirected weighted graph in which the nodes are the symbols and the edges are the relations between the symbols in represented by the relation matrix A. The weight on the edge { , } is defined as:
( 5) where is the number of pointers, is the total number of symbols, the number of symbols in the set , and is an element of the relation matrix.
The weight in the affinity graph increases with the number of sets that contain both and : when two location sets are in many points-to sets, we want their codes to be close. This is even more important for small points-to sets. For example, if we have symbols in the points-to set , their codes must be next to each other to minimize the dimension of the supercube of the encoded set , whereas if we have symbols in the points-to set , the Hamming distance between the encoding of the symbols in the points-to set can be as much as . Therefore, the weight is the sum of the contributions of the points-to sets that contain both and , where the contribution of each points-to set is .
The pointer encoding problem can be solved as an embedding of the affinity graph in the Boolean hypercube as done in [38, 35, 3, 21, 38] . Figure 10_a) can be used to generate the affinity graph of Figure 10_b . 
Example 20. The relation matrix presented in Example 19 (cf.
Let's look at the weight on the edge {a,b}. The variables a and b are both in the points-to sets of r1 and q1. The weight is 3, sum of 2, contribution from r1, and 1, contribution from q1.
After graph-embedding, the encoding presented in Figure 11 
Encoding with folding
In the local encoding problem, two symbols can share the same the code. The rationale for this proposition is that we want to distinguish each symbol inside a points-to set and, in the case of a comparison, we want to distinguish the symbols in the two dependent points-to sets.
In the relation matrix A, folding the symbols and is equivalent to replacing columns i and j by one column k such that:
for l in {1, 2,..., N}.
In the affinity graph, folding is done by merging (or fusing 1 ) the nodes corresponding to the symbols , into one new node corresponding to . The weights on the edges incident to this new node corresponding to are then defined as:
Graph-embedding techniques can be modified to incorporate folding. In Section 6.6, we present a columnbased encoding algorithm with folding. Figure 12 , where r1, r2, and r3 point respectively to {a,b,c}, {b,c,d} and {c,d,e}.
Example 21. Let's consider the pointer-dependence graph on
1.A pair of vertices a, b in a graph are said to be fused (merged or identified) if the two vertices are replaced by a single vertex such that every edge that was incident on either a or b or on both is incident on the new vertex [13] . The relation matrix and the associated affinity graph are represented in Figure 13 . The symbol a is in the points-to set of r1 and q1, whereas the symbol e is in the points-to set of r3 and q2.
According to the pointer-dependence graph, these points-to sets are not dependent. The symbols associated with
a and e can be folded. After folding we end up with the graph on Figure 14 .
This leads to an encoding that requires only two bits:
Encoding with splitting
In the local encoding problem, one symbol can also have different codes in the different points-to sets. 
Definition 10. We define as splitting the action of assigning two or more codes to one symbol (or location set).
In Section 6.3 and 6.4, each location set was associated with a unique symbol that was encoded. After splitting, one location set may be associated with more than one symbol: splitting a symbol is equivalent to creating a new symbol which corresponds to the same location set. The original symbol and the newly created are then encoded into and respectively. Figure 17 where r1, r2 and r3 may respectively point to {a,b}, {b,c}, and {a,c}. The relation matrix and the corresponding affinity graph are presented in Figure 17 .
We would like to encode r1, r2, and r3 with 1 bit and q with 2 bits. We also want the codes of r1, r2, and r3
to be subfields of the code of q. Figure 18 . After splitting the symbol a, we end up with the two symbols a and a'. The new encoding problem is presented on Figure 19 . We can find the encoding on Figure 20 where the symbol a is in the points-to set of r1, r2 and q, and a' in the points-to set of r3 and q.
Using the encoding technique without splitting symbols, we can find the encoding on
The encoding in Figure 20 is optimal: r1, r2, and r3 are encoded on 1 bit and the assignments to q (q=r1, q=r2, q=r3) don't require any additional logic.
As described in Section 6.2 the symbols in each set can have different codes. Therefore, to minimize the dimension of the supercube of the encoded symbols in a points-to set (i.e. Eq. 1), we can create new symbols associated with the same location sets for this points-to set. Note that, if we split the symbols for each points-to set, we end up with a local encoding scheme close to the one presented in Section 6.2. The only difference is that one symbol may have multiple encodings within the same points-to set. However, to limit the increase in complexity, we are trying to split as few symbols as possible and only when useful to reduce the cost function.
When a symbol is split, a new symbol is created. For each points-to set such that , we decide whether the new points-to set contains , or both and . The new set of encoded symbols can be defined as:
where is either { }, { } or { , }. 
In order to minimize Eq. 1, for every set that may contain or , we want to minimize ( 9 ) which corresponds to where is either { }, { } or { , }.
In the relation matrix A, splitting is done by adding a column relative to . For each row corresponding to a points-to set such that , the pair of entries ( , ) is set to (0,1), (1, 0) Figure 19 , the entry is set to 0 and is set to 1. For the points-to set of q, Eq.
is minimum (equal to 2) when E' is either {a}, {a'} or {a,a'}. E'={a,a'} is then selected and the new points-
to set of q contains both a and a'. Consequently, the entries and are both set to 1. Since a is in the new points-to set of r1 and a' in the new point-to set of r3, this allows to implement both q=r3 and q=r1 trivially.
Encoding Algorithm
We propose a column-based approach such that the encoding matrix can be found column by column [14, 11, 10] . Our algorithm without folding and splitting is similar to the one used in POW3 [3] . The pseudo code of the algorithm with folding and splitting is presented on Figure 21 .
The algorithm encodes the pointers with n bits where . We consider one bit of the code at a time. For a symbol associated with the code , we consider the bits for k={1, 2,..., n}. At each iteration k, we construct the k th column of the encoding matrix E by assigning bit to all symbols for i={1, 2,..., N}. We ultimately want to distinguish all symbols. Therefore, in our algorithm, we have to make sure that at each iteration k we have less than symbols associated with the same code. For example for k=(n-1), we cannot have more than two symbols with the same code.
Definition 11.
There is a class violation at iteration k when more than symbols have the same code so far. Figure 21 : Graph embedding algorithm with splitting and folding
Note that, at iteration k, we are only considering the k first bits of the codes, since the other ones haven't been assigned yet.
At each iteration k, is defined for every symbol . The assignment is done by considering the symbols on every edge end-points, starting with the edges with highest weights. The weights at each iteration are adjusted using the following formula [3] : (11) where is the Hamming distance between the partially assigned codes of symbols and .
For the symbols incident to the edges { , }, we try to assign the same value to both and . However, this may not be possible in two cases. First, at each iteration of k, the number of symbols having the same code is limited to prevent class violations (cf. Definition 11). Moreover, if the symbols and are also incident to other edges whose weights are higher than , they may already have been assigned two different values and .
These two conditions are expressed in Proposition 3 below.
Definition 12. An edge { , } is said to be violated at iteration k, if the bits and associated with the two symbols incident to the edge, have different values.
Proposition 3. An edge { , } is violated at iteration k if either one of the following conditions applies:
• there is class violation (and therefore, and need to have different values),
• different values and have already been assigned to the two symbols.
In the case of a class violation, we try to fold one of the symbols on the edge { , } with any of the previously assigned symbols. At this stage, two symbols are folded if Proposition 1 holds and if they have the same partial code so far.
If the edge { , } is still violated (i.e. ), we try to split the symbols incident to the edge. One symbol can be split if the newly created symbol does not cause any class violation or can be folded with another symbol.
In our algorithm, for a symbol , we create a new symbol associated with a code such that for l<k and . In case of a class violation, we try to fold this new symbol. If folding cannot be done, the symbol is not split.
Example 24.
Consider the problem presented on Figure 22 . The associated relation matrix and affinity graph are presented on Figure 23 in which pointer q1 may take the value of r1, r2, or r3 and q2 may take the value of r3, r4, or r5. The variable d (which is now mapped to a symbol representing both d and a') can also be split and the new symbol d' can be folded with a. The final relation matrix is then:
We end up with the encoding on Figure 24 in which all constraints are satisfied.
IMPLEMENTATION
We have implemented the different algorithms using the SUIF environment [52, 66] . The toolflow is presented on Figure 25 . Our implementation takes a function with pointers in C and generates a module in Verilog. This module can then be synthesized using the Behavioral Compiler of Synopsys [67] . For hardware synthesis, the tim- ing information is expressed in the C model: wait() in C will be translated into @(posedge clk) in Verilog.
The ports and the data types are defined in a separate header file. The translation from C to Verilog consists of different passes. After the front-end, we inline the functions and perform the pointer analysis [50] . Then the points-to information is used to remove and optimize pointers in the following order:
-define the points-to set of each pointer;
-replace the loads and stores (insert star_p and tmp_p);
-optimize load 1: define star_p when p or any variable of the points-to set change; -optimize loads followed by stores: create the _starN_p variables; -optimize load 2: kill star_p when all variables of the points-to set are live;
-encode pointers' value;
-dead-code elimination.
The intermediate code without pointers is then translated into Verilog using Csuif2Verilog.
We have recently ported our research to the Synopsys Cocentric SystemC Compiler [68] to synthesize C models into hardware directly, without having to translate C into HDL. In addition, we have also developed a tool to implement dynamic memory allocation in hardware [44] . 
RESULTS
We first show the results for the resolution of pointers in relatively large examples. Then we illustrate the effect of pointers' encoding and of the optimization of loads and stores on selected examples.
Since there are no synthesis benchmarks written in C with pointers, the objective of this section is to show the technical feasibility of mapping C descriptions to logic gates. In order to test our tool on real examples, we present the implementation of two algorithms, a two-dimensional inverse discrete cosine transform (2D IDCT) [31] and an alpha blender, written in C. The 2D IDCT is widely used in image compression standards such as JPEG, MPEG and H263. The 2D IDCT implemented consists of two one-dimensional IDCTs (1D IDCTs). For this purpose, we use three different memories: the input buffer (in_table), the intermediate buffer that stores the result of the first 1D IDCT (buf_table) and the output buffer (out_table). These memories are accessed through pointers and pointer arithmetic. Pointers are also used in the 1D IDCT to reference two register banks (buff1 and buff2).
The 2D IDCT is implemented using only one call to 1D IDCT (function 1d_idct) which is inlined before synthesis: Note that, in this specific example, pointers are not only used to access memories, but they are also used for sharing resources; in this example only one 1d_idct is synthesized. Since functions are inlined in our framework, a more standard implementation of the 2D IDCT algorithm in which the 1d_idct function is called twice would lead to two 1D IDCT blocks. Such a design would typically be larger and more difficult to efficiently synthesize. Using pointers here provides a convenient and efficient way of performing resource sharing.
The second example corresponds to an alpha blender. Alpha blenders are used in video and signal processing to superimpose multiple images. Our implementation takes three images and alpha planes of size 8x8. The alpha plane defines the degree of opacity for each pixel in the image. The order in which the images are placed with respect to each other (e.g. front, middle, back) is defined by a layer number associated with each image. The different images and alpha planes are stored in separate arrays (mapped to separate memories) in order to access them in parallel. Pointers are used to access the different arrays.
The results after synthesis are presented on Table 1 . The CPU time for translating the C model into Verilog was calculated on SunUltra2. The Verilog modules were synthesized with Behavior Compiler without unrolling loops. The architecture of the IDCT is presented in Figure 26 . The design consists of five multipliers, four adders and two ALUs. Other implementations can be found by changing the timing and resource constraints.
We have written several models to study the effects of the different optimizations presented in Sections 5 and 6. These optimizations consist of encoding the pointers' value and reducing the number of live variables before loads and between loads and stores.
The first set of results illustrates the effects of each feature of the optimizer. Table 2 and Table 3 show the examples with the area and cumulative timing after pointer resolution with and without optimization. In the second example (load/store), we have a pointer that may point to two integer variables stored in registers. This pointer is used as a parameter in a function call. After inlining the function, we end up with a load followed by a store. Here the optimization saves one register with a little increase of the amount of steering logic.
Finally, the last example (encoding) implements the model described in Example 15 with the two encodings presented in Example 16. Here the encoding of the pointers value reduces the combinational logic by 40%.
Since the design is simpler, the circuit is also faster.
The second set of examples compares our encoding algorithm to other encoding schemes. The results are presented in design: the number of registers necessary to store the pointers' value (storage), the logic necessary to assign and compare pointers (assignment) and the implementation of loads and stores (load/store). Each of these components is synthesized using Synopsys Design Compiler. We present the results for five different schemes.
First we present the results for a global encoding (global) in which we associate the same code with all symbols associated to the same variable in the different points-to sets. In this case, assignments or comparisons of pointers can be performed without translating the values of the pointers. However, the number of bits used for the encoding is not minimal, which leads to larger decoding circuits (cf. both load/store and assignment) and more registers (cf. storage).
The second scheme (simple-alg) is the implementation of the heuristic algorithm presented in Section 6, without splitting and folding. The size of the pointer is then reduced but is still not always minimal. The results for the algorithm with folding and splitting (split&fold) are given. The length of the codes is then close to the minimum and the size of the combinational circuit for both assignment and load/store is reduced, which gives better results. Results for minimum-length encoding (min-length) are also given. In this suboptimal encoding (similar to the non-optimal encoding used in Example 16), each variable in each points-to set is simply associated with a number (0 for the first variable, 1 for the second variable, etc...). The number of bits used to encode each tag is then minimum but the size of the circuit which translates the values of the pointers is not. Finally one-hot (1-hot) encoding gives larger codes, however the specific proprieties of the resulting codes can be used to simplify the decoding logic especially in loads and stores.
In this section, we have shown how C code with pointer variables can be synthesized by removing the pointers and using high-level synthesis. Moreover, variations on the implementation may be explored using the optimizations presented in Sections 5 and 6. Even though the effect of these optimizations may be limited in general, they can be used to reduce the storage areas and/or the steering logic. In particular, optimization of loads and stores can be used to reduce the number of registers with an increase on the amount of steering logic. Encoding, on the other hand, can be used to reduce both the size of the pointers and the logic necessary to translate and decode the pointers' value, leading to better performances.
CONCLUSION
We have presented how C code with pointers can be efficiently mapped to hardware. With our methodology, memory is partitioned into location sets and pointer analysis is used to define where locations are accessed in the program. Pointers can then be synthesized by encoding their values and by generating circuits to dynamically access the different locations they may reference.
Our toolflow fits into current methodology and supports the mapping of data to multiple memories, registers or wires. Compiler techniques are used to reduce the storage before pointer loads and stores. Heuristics are used to efficiently encode the values of pointers by reducing their size and by optimizing the circuits implementing assignments and comparisons of pointers.
The synthesis of pointers raises the level of abstraction at the input of high-level synthesis. Models can be described at the behavioral level using the notions of a single address space and of indirect memory references found in many programming languages. The techniques and optimizations presented here can be generalized to support more of the C/C++ syntax as well as other programming languages, facilitating the mapping of functions and complex data structures including object-oriented features into hardware.
