Hybrid architectures combining conventional processors with con gurable logic resources enable ecient coordination of control with datapath computation. With integration of the two components on a single device, loop control and data-dependent branching can be handled by the conventional processor, while regular datapath computation occurs on the con gurable hardware. This paper describes a novel pragma-based approach to programming such h ybrid devices. The NAPA C language provides pragma directives so that the programmer or an automatic partitioner can specify where data is to reside and where computation is to occur with statement-level granularity. The NAPA C compiler, targeting National Semiconductor's NAPA1000 chip, performs semantic analysis of the pragma-annotated program and cosynthesizes a conventional program executable combined with a con guration bit stream for the adaptive logic. Compiler optimizations include synthesis of hardware pipelines from pipelineable loops.
INTRODUCTION
Recently, researchers in the Adaptive Computing community h a ve proposed hybrid architectures composed of a conventional processor tightly integrated with con gurable logic 10 , 6 , 1 . In these hybrid systems, FPGA state is readily accessible to the main processor with single cycle latency rather than at the other end of an I O bus. Processor state can be communicated to the FPGA circuits equally quickly. This new capability enables applications developers to rapidly shift focus between conventional processor and con gurable logic, performing a given computation on the platform conventional processor or FPGA best suited to the particular computation. Thus control functions, typically more di cult to express and manage in hardware, can be delegated to the conventional processor, while the inner loop" can be mapped to the FPGA without losing appreciable performance to transfer control information between the two components.
Unfortunately, there is a dearth of tools that target such devices. Conventional compilers can generate object les for the RISC processor, but do not synthesize hardware for the parts of the computation mapped to the FPGA. Similarly, while many tools exist to synthesize logic from schematics, VHDL, or Verilog, these tools do not address the part of the computation mapped to the RISC processor. Further, current tools do not address the implicit communication that must occur between the two components as the focus of control shifts between them.
In this paper we present a language NAPA C and compiler that address the hardware-software co-synthesis problem in the context of hybrid RISC FPGA processors. NAPA C constructs allow the programmer to explicitly map data and computations to either RISC processor or FPGA. The NAPA C compiler generates a conventional C program that contains portions of the computation assigned to the RISC processor as well as C code to control circuits generated for the FPGA. Through the MARGE datapath synthesis tool, the compiler generates, for the computation mapped to the FPGA, a Verilog netlist that utilizes highly optimized pre-placed, pre-routed macro generators. The compiler targets the NAPA1000 hybrid processor, a chip which combines a small RISC processor the Fixed Instruction Processor or FIP with con gurable logic the Adaptive Logic Processor or ALP. The NAPA C compiler supports regular datapath computation on the ALP. The compiler also recognizes ALP functions," which serve as new CISC instructions to augment the RISC instruction set. In addition, the compiler analyzes C loops whose body is mapped to the ALP and where possible, generates hardware pipelines for pipelineable ALP loops.
The remainder of the paper is organized as follows. The next section reviews related work. Then we discuss brie y discuss the NAPA 1000 architecture. The next section describes the NAPA C language. Next we describe the compiler structure, including the SUIF 5 compiler infrastructure and the MARGE datapath compiler, with emphasis on the loop pipeline scheduling phase. Finally we end with conclusions and future directions.
RELATED WORK
A n umber of research projects address compiling for hybrid conventional RISC FPGA architectures. The RAW project of MIT 11 is studying a systolic-arraylike tiled architecture whose components are hybrid RISC FPGA processors. They are studying compiler optimizations to exploit instruction-level parallelism, and to partition and map computation to the array o f tiles.
The Berkeley BRASS project 6 is designing a hybrid MIPS architecture that includes a recon gurable coprocessor. Compiler techniques are focused on Javabased generator languages.
Oxford University researchers have developed a cosynthesis system that accepts speci cations in SML and generates occam software and Handel hardware 9 components. In contrast our work is based on a more pragmatic level in which algorithms expressed in C are partitioned between hardware and software.
Weinhardt discusses generation of pipelines for FPGA processors 12 with implementation on the EV-1 of a vector computational model for FPGA computing. 3 The NAPA1000 Architecture
The NAPA1000 Figure 1 , from 1 integrates on a single chip a small embedded 32-bit RISC processor Fixed Instruction Processor, or FIP along with con gurable logic, on-chip memory, and an interconnection network interface. The Compact RISC processor is an embedded processor proprietary to National Semiconductor. 
NAPA C LANGUAGE
The NAPA C language presents a tightly integrated programming model encompassing data storage and computation on both FIP and ALP. These concepts are available to the programmer with explicit new directives. The extensions provide the following capabilities:
1. indicate whether a variable will reside on the ALP as a register, in ALP local memory, or in external memory accessible both to FIP and ALP. 2. indicate the bit length of an integer variable on the ALP.
3. indicate whether a subroutine is to be compiled for the FIP or the ALP.
4. indicate whether an expression is to be computed on the FIP or the ALP.
5. indicate when computation is to occur in parallel on FIP and ALP co-begin, and when parallel threads are to merge join.
6. initiate I O through the Con gurable I O pins.
LANGUAGE EXTENSIONS
There are several di erent s t yles that may be used to introduce these extensions into the language. The rst is to add keywords and or overload existing syntax in C. This approach limits portability. Another approach is to provide new classes in object-oriented languages such as C++ or Java. We h a ve c hosen not to use this approach because we are using the SUIF compiler infrastructure, which currently supports C.
The alternative w e h a ve c hosen is to use pragmas. Pragma directives are commonly used in the high performance computing community for example HPF and OpenMP. These "" directives are parsed by the NAPA C compiler, and code is generated accordingly. Other C compilers will simply ignore the pragmas and compile the entire program to a FIP architecture. The advantage to this approach is that the program is in standard C and can be compiled with pragmas in place and debugged on the workstation with a standard C compiler. It can also be compiled with the NAPA C compiler and simulated via the FIP ALP simulator to tune performance by adjusting the program partitioning. The NAPA simulator is a cycle accurate combined FIP ALP simulator. FIP operations are simulated in a simulation model of the CR32 processor. ALP operations are simulated by a logic simulator that models the behavior of core cells and routing resources. Finally, the program can be run on the NAPA1000 part when it becomes available.
NAPA C PRAGMAS
Pragmas can occur in declarations and in executable code. Pragmas associated with declarations de ne either the location of the variables being declared or the bit lengths of ALP register variables. Pragmas associated with executable code de ne where the computation is to occur ALP or FIP. The pragma is inserted after the normal C declaration of the variables referenced . The following options tell the compiler where variables are to reside. There are ve alternatives: ALP core cells; ALP scratchpad memory; ALP memory modules 1 or 2; external memory accessible to both FIP and ALP. External memory is the default alternative, which the compiler will assume in the absence of other pragmas.
1. pragma MALP loc reg variable-name-list . This location directive means that the variables named will be allocated on ALP core cells as ALP registers."
2. pragma MALP loc mf0 j 1 j 2 j 3g. This location directive means that the variables will be allocated in a memory rather than on core cells. The data may get allocated in external memory accessible to both FIP and ALP m0; ALP local memory m1 or m2 or the scratchpad m3.
3. pragma MALP size bit-length variablename-list . This directive sets the bit length of the named variables to bit-length . 
EXECUTION PRAGMAS
In addition to allocating variables to FIP or ALP memory, the programmer may indicate on which computational engine the computation is to occur and how parameters are to be passed to ALP subroutines. To specify that an entire subroutine is to be executed on the ALP and that parameters to and from the subroutine are to be passed through the internal bus, the following pragma must be inserted following the function declaration: pragma MALP function function-name size result-size The compiler synthesizes a circuit for the function body, with parameters copied from the internal bus to ALP registers. Upon function exit, the return value is written to the bus. In the FIP program, code is generated to copy the return value from the bus into FIP memory. Figure 3 shows an example of using an ALP function from within a FIP loop. The function is called repeatedly until a termination condition is met. The example illustrates the ne-grained alternation of focus of control that is possible in this hybrid architecture. Logic Op is e ectively a new CISC instruction that augments the RISC instruction set. The low order four bits of a, b, c, and d are passed on the FIP-ALP communication bus, and the return value back on the same bus.
ALP BLOCK
The following directive brackets statements that are to be performed on the ALP. The directive indicates that the enclosed statements are to be synthesized as one or more ALP instruction, where an instruction may execute over multiple clock cycles. alp begin; * ALP-resident hardware structures will be synthesized from the C code here * alp end;
When the ALP function or ALP block is called, the FIP program initiates the ALP circuit, and then waits for the ALP to signal completion. The following section describes how the the FIP and ALP may b e directed to operate concurrently.
INTRINSIC FUNCTIONS FOR CON-CURRENCY
C is inherently a sequential language. The NAPA C compiler analyzes an ALP computation for ne grained parallelism, and schedules hardware parallelism where it can ensure semantic correctness to do so. However, automatic detection and scheduling of larger grained parallelism such as co-begin parallelism between the FIP and ALP is beyond the scope of the compiler. To i n troduce explicit parallelism into the language, we use intrinsic functions to control concurrent FIP ALP execution. In our model of concurrent activity, the FIP controls initiation and termination of concurrent activity. To initiate a concurrent thread of activity on the ALP, the programmer uses an intrinsic function ALP THREAD function name , parameter1, parameter2, : : : . The compiler recognizes ALP THREAD as a pre-de ned function and initiates autonomous operation on the ALP. FIP statements subsequent to the ALP THREAD directive are then executed. The programmer indicates the joining of the concurrent threads with a ALP JOIN directive. Note that at present w e only allow one autonomous thread to be active on the ALP.
INTRINSIC FUNCTIONS FOR CONFIGURABLE I O
Any ALP cell can be con gured to drive a con gurable I O pin. Thus from the standpoint of NAPA C programming model, we h a ve the capability to designate any ALP variable as an input or output. In addition, this unique capability is dynamic rather than static, so that during program execution, an ALP variable may be set to accept input or to supply output. The intrinsic functions ALP INPUTvariablename-list, ALP OUTPUTvariable-name-list and ALP IOvariable-name-list are used to mark the ALP registers that are to drive I O pins 1 .
The collection of pragmas described above allows the programmer to partition the data and computation between the FIP and ALP. Although not addressed in this paper, partitioning decisions could be made by an automatic system, resulting in pragmas being mechanically inserted into the C program.
NAPA C COMPILER
The NAPA C compile, in conjunction with backend tools, generates a combined FIP ALP executable. The compiler allocates the data to the desired memory space. It generates FIP code for FIP subroutines and statements via the FIP C compiler. It synthesizes hardware structures to represent the C code of the ALP subroutines and statements, and also generates FIP code to control execution of the ALP segments. When an ALP variable is referenced in the FIP code, the NAPA C compiler generates the code to fetch the variable into the FIP which will require both FIP and ALP code".
COMPILER ORGANIZATION
As shown in Figure 4 , the NAPA C compiler consists of a number of phases. Input to the compilation system is ANSI C annotated with pragmas as described above. The rst phases are embedded in the SUIF compiler infrastructure. The pragmas are converted to SUIF annotations to the syntax tree. A semantic type propagation phase resolves data transfers between FIP and ALP. Such data transfers are needed, for example, when ALP data is accessed in a FIP computational block. Next, ALP segments are extracted from the syntax tree and passed to the MARGE datapath compiler. FIP code is unparsed to C and processed by the RISC processor's C compiler part of the NSC Tools block. MARGE synthesizes the hardware circuits for the ALP segments and generates both VHDL The MARGE datapath compiler is described in 4 , and details of the synthesis to pre-placed, pre-routed macros in 3 . The NAPA1000 has been designed to support several di erent programming models. The instruction enhancement mode" allows ALP circuits to augment the RISC instruction set with custom instructions. The ALP function pragma gives the programmer direct access to this model. NAPA C con gurable I O intrinsics allow the programmer to use the sensor actuator programming model, in which data sent or received over con gurable I O can be processed in the ALP. In the next section, we describe how the NAPA C compiler supports a pipeline processing model.
PIPELINE LOOP ANALYSIS
Con gurable logic is a natural candidate for pipeline synthesis. Hardware pipelines, with a long history of use, keep various stages of a unit busy by starting a new operation while the previous operation is still in progress. Software pipelining adapted this idea to a code sequence such as a loop: in some cases a new loop-iteration can be started while a previous iteration is in progress. The Napa C compiler's pipelining strategy for loops extends standard scheduling methods to pipeline loops. The algorithm implements the method described by Lam 7 .
The pipeline scheduler uses a dependence graph in which the nodes correspond to ALP function units. The edges of the graph are precedence constraints. A pair n i ; n j is a normal edge if, in any iteration of the loop, node n i must be executed before node n j . If n i ; n j a special edge, then, in any iteration node n j must be not be executed before node n i . A special edge permits concurrent execution of two nodes. This lets us model register use in which, during a single cycle, a value can be read from a register early in the cycle and written to the register late in the cycle.
The scheduler produces a schedule for a regular pipeline, in which e v ery iteration is executed according to the same schedule, with successive iterations initiated at a xed initiation interval s. A s c hedule gives a list of the nodes to be executed at each time step. The steady state of the pipeline is given by the last s stages of the schedule. Throughput is one iteration every s cycles.
The pipeline scheduler tries to nd a schedule with successively increasing iteration intervals s m , where m is the length of a nonpipelined schedule, which is the length of the longest path in the dependence graph. We use a technique suggested by Lam to nd a lower bound for the initiation interval, based on multiple uses of resources, rather than beginning at s = 1 .
The scheduling algorithm works as follows. For each n o d e n in the dependence graph, the precedence constrained r ange for n is an interval t 1 ; t 2 in which n can be scheduled. For a graph containing N nodes, an upper bound on the length of the pipelined schedule with initiation interval s is N + s. The scheduler initializes the precedence constrained range for each n o d e to 0; N + s . The range is updated at each s c heduling step, as described below, to preserve the partial order given by the dependence graph.
The modulo-s reservation table consists of s resource vectors used i that are initially zero. This models resource by time step modulo s, in an innovation that Lam introduced into software pipelining, extending the idea of a reservation table used by traditional hardware pipeline schedulers to track resources shared by pipeline stages.
The pipeline scheduler selects nodes in an order that ensures that all their predecessors in the partial order given by the dependence graph have already been scheduled. To s c hedule a node n, the scheduler rst limits the range to the rst s stages of n's precedence constrained range. This su ces because, with initiation interval s, at each stage t all the nodes scheduled at stages t mod s are executed. If a node cannot be scheduled within s stages, it cannot bescheduled. The scheduler places n at the rst such stage t in which the resources it requires are available: used t mod s needs n = 0 where is the logical AND operator.
Having chosen stage t for node n, the scheduler updates the vector used t mod s in the modulo-s reservation table. For each n o d e n 0 that is a successor of n in the dependence graph, it updates the lower bound in the precedence-constrained range of n 0 from t 1 ; t 2 to t 1 = maxt 1 ; t + where is the length of the longest path from n to n 0 . Lam's algorithm and our implementation of it recognize a class of inter-iteration dependence termed doacross, in which a v alue computed in one iteration is used in a later iteration. If the newly scheduled node n is the target of a doacross edge in the dependence graph, then scheduler also updates the precedence-constrained range of the node at the source of that edge.
When the hybrid system executes the program, it performs the pipelined loop as follows:
1. The FIP processor sends the ALP a signal to execute an initialization circuit to set up data and loop-control registers. 2. Then the FIP sends the ALP a signal to perform all iterations of the loop. On the FIP, the loop iterations are pipelined, with a new itertion started every s cycles, where s is the initiation interval determined for the schedule. shown. The partial sum is accumulated in Reg3. After the inner loop is executed N times, a nal circuit, not shown, is executed once to store the result to a i,j .
For the inner loop, the Napa C compiler generates an ALP circuit with the nodes scheduled as shown in Figure 6 . A memory access takes two stages, one to load the memory address register MAR with the memory address, and one to load or store the memory data register MDR. In the rst stage, the addresses for b i,k and c k,j are loaded into their respective MAR's. Nodes 7 and 8 are also executed, to increment the addresses of b i; k and c k;j . Notice the concurrent reading and writing of registers. For example, node 1 reads MAR0 at the beginning of the stage, to get memory address, and node 7 writes it at the end of the stage, after incrementing.
In the second stage the data are loaded from memory modules 0 and 1, respectively. The multiplication occurs in the third stage. The fourth stage has an accumulation operator that adds its input to the partial sum it maintains.
The throughput analysis summarizes the results of pipeline scheduling. Pipelining found a schedule with an initiation interval of one. The throughput of the pipelined version is one iteration every cycle, an improvement factor of four over the non-pipelined version.
Pipelined Walsh-Hadamard transform
The Walsh-Hadamard transform, widely used in signal processing applications, presents an example in which the pipelined schedule is di erent from a nonpipelined schedule. Figure 7 shows the equation calculated in To eliminate control ow in the inner loop, we rewrite the code to calculate Z 0 j = x A1 + y A2 where A1 and A2 are the two possible values shown in Figure 7 , and x and y are truth values that are 1 to select the term and 0 to ignore it. Figure 8 shows the three-address code for this linear form of the inner loop of the code.
The code reads Zk from Memory 0 and writes Zk+1 in Memory 1. Successive iterations of the outer loop alternate in copying from one memory to the other. Notice that this code is an excellent candidate for dynamic partial recon guration, to reverse the input and output memory banks.
