The paper introduces a novel parallelizing compilation method for the MOM. The MOM (Map-oriented Machine) is an Xputer architecture featuring multiple data sequencers and "so$ ALUs". The compiler accepts C-source, which are restructured and partitioned into structurd and sequential code providing parallelism at expression and statement level.
Introduction
Today we are facing increasingly complex tasks to be performed by computers. Many of these tasks are computation-intensive requiring a huge amount of data throughput and high performance. From empirical studies [9] follows that the major amount in computation time is due to rather simple loop constructs. Since additionally these loop constructs are combined with indexed array data structures, ordinary von Neumann style computers are burdened with mainly addressing computations rather than actual data manipulations. First efforts to reduce addressing overhead and to introduce parallelism have been done by the development of supercomputers [2] , [7] , and the development of parallelizing compilers [4] , [lo] . But the exploitation of inherent parallelism is restricted by the hardware structure of the target machines.
A new machine combining the advantages of both structural programming and traditional von Neumann style procedural programming has been introduced by the architectural class oT
, a data-parallel machine with shared memory. Field-programmable logic is used to offer codigurable instructions which allow a fully parallel execution in contrast to e.g. vector and other parallel computers which are mostly working in a pipelined manner.
Thus several requirements arise for the development of a new Xputer compilation method.
As input language the imperative language C has been taken. A fine grained parallelism at statement and expression level has been achieved in order to enable the exploitation of the reconfigurable ALU (rALU). A second major issue in Xputer program compilation is the extraction of the program's data and its mapping in a regular way over the Xputer memory space. This data arrangement together with the extracted data dependencies determine the required data sequencing and thus substantially contribute to the efficiency and performance of the program execution [ 11. A third issue is the synthesis of the structural code onto the rALU. Before the parallelizing compilation method is explained, the target hardware is briefly sketched by introducing the Map-oriented Machine 3 (MOM-3).
1063-6862/95 $4.00 0 1995 IEEE 129
The Map-oriented Machine
The new architectural class of Xputers [ 11 is especially designed to reduce the von Neumaun bottleneck of repetitive decoding and address interpreting. 'This bottleneck contributes a significant amount to the run time of algorithms out of these areas (90% in image processing, 58% in DSP [I] ).
Although the actual prototype MoM-3 may serve as stand-alone machine it is currently embedded as a general-purpose a-processor in a VMEbus based workstation. After setup, the MOM-3 runs indcpendently from the host computer until the complete application is processed. Setup in this case means, that the host software has to load the application data into the MOM-3 data memory, 10 load the parameter sets of the generic address generators, the rALU configuration code and the program for the MOM-3 controller into the control memory and to initiate execution.
The MOM-3 supports up to seven generic address generators (GAGs), each with its own segmenl of data memory and a rALU subnet on the same board, called computational modulc (C-module). The rALU subnets of all C-modules are connected to allow propagation of interim results to the next board (figure 1). That way complex operations, which require more resources than a single rALU
subnet can provide, can be done on muluple modules. As long as the data requred by a rALU subnet resides in the memory segment on the same C-module, data accesses can be done in parallel to data accesses on the other modules. Otherwise, data is transferred on the common MoMbus, whch enforces a sequentialization of non-local data accesses. The rALU is based on field-programmable logic (FPL) In the MOM the reconfigurable datapath architecture (rDPA) is used, an FPL archutechire with better throughput and hgher area efficiency than FPL avalable commercially [6] The (re-)configurabon of GAGs and rALU subnets is done by a special MOM-3 controller circuit (M3C) It holds all configuration data for a complete a p p b c~o n in its memory and is able to switch between configurations by downloadmg them to the GAGs or rALU subnets The MOM-3 operates as a lund of configurable co-processor to a host computer All I/O operations are done by the host m wdl the memory management. The MOM-3 is a merely computational device to accelerate tme-cnhcal parts of algonthms. The host has direct access to all memory on the MOM-3, and on the other hand the MOM-3 can access data in the host's memory, though only sequentially due to thc single bus
The Compilation Method
A partitioning, restructuring, and mapping method is needed to translate a sequential C program into code which can be executed on an Xputer. This paradigm switch shall be performed without further user interaction. The method itself deals with the fundamental problems similar to those in compiling a program for parallel execution on a multiprocessor system. These problems are: (1) Identify and extract potential parallelism, (2) partition the program into a sequence of execution unils according to the granularity of the architecture and the hardware constraints, (3) compute an efficient allocation scheme for the data in the Xputer data map, and (4) synthesize the structural code into the rALU and the remaining procedural ccde as parameters into the data sequencer hardware.
For Xputer compilation all these problems have to be solved during compile time. First a theory is needed for the program partitioning and restructuring (parallelization). The result of this step is thc determination of a partial execution sequence. Secondly the program's data has to bc mapped in a regular way onto the 2-dimensionally organized Xputer data map, followed by a computation ol' the right address accesses (data sequencing) for each variable. Thus far all steps are target-hardware independent. Code generation for the MOM-3 results (1) in a hardware imgefile containing stmcturd information for the configuration of the rALU, and (2) in a software image@ containing the parameter sets for the data sequencer hardware, especially the generic address generators.
Determination of a Program Execution Sequence
The result of the parsing of the program is a graphical representation G = (N, E) with nodc set N and arc set E. The control flow of the program has to be partitioned first, in order to find p&alleliza-ble subgraphs. This step is followed by a data partitioning by partial vectorization [ 101.
Partitioning of the Control-Flow. Given is the program graph G = (N, E). The node set N has to be partitioned, resulting in a number of subgraphs Gk, KkSn, and an arc set E*, defining a partial execution order. For the structure of the subgraphs an additional criterion has to be forniulated, namely that each subgraph has to be convex. The question is now, how the partilioning ol' a graph into convex subgraphs can be achieved. This is performed by specifying an equivalence rclation on the node set N and building the corresponding equivalence classes which represent convcx subgraphs. The method for control flow partitioning results in a coarse grained block sequence with a partial execution order [8]. These blocks are still target-hardware independent.
Partitioning of the Data-Flow. The goal of the next compilation step is to maximally parallelizc each of the determined blocks in the sequence. Therefore the subgraphs are then maximally vectorized by using the Allen-Kennedy Vectorization Algorithm [3] . The result is a new sequence of maximally parallelized blocks containing a partial execution order. Vectorization then gcnerates a maximum degree of parallelism in a statement block of a loop nest for statements having no dependencies or not being part of a recurrence, nor member of a cycle [3] . Here the special Xputer granularity is taken into consideration and the hardware resources are exploited optimally [XI. Thls is the kcy for achieving a high performance during Xputer program execution.
Data Mapping and Data Aligning
The next step in compilation is to decide how the program data (variables, arrays, . . .) can be mapped onto the two-dimensionally organized Xputer data map DM = {DMm DMy/ in a regular fashion, and how the data fields of differently mapped data variables can be aligned to a combined data field in order to use only one scan pattern. Planarhatinn. The two-dimensional data map DM is in contrast to the defined mays which have higher dimensions. This leads to the mapping problem resulting in the definition of a data allocation scheme. The target hardware parameters (e.g. seven GAGS are available for the MOM 3) havc to be fulfilled. This leads to the data alignment problem. Unrolling the dimensions I of a variable A defined to be d-dimensional, d 2, and d t N , means to determine a function dmap from thc index domain of the data object to the two-dimensional index domain of an Xputer data map DM, by dmap: I? --t { DMx, DM } , with 15 i < n.
(1)
Y
The dimensions in the may definition are numbered from the right to the lert and are then mapped. Even numbers are mapped onto the x-coordinate (DMJ, odd numbers onto the y-coordinate (DMy) of the Xputers data map DM. Xputer dimension mapping is a planarization [SI.
Determination of the Data Sequencing
The accessing of the program data variables by their indices is needed for the generation of scan pattems from which the parameter sets for the data sequencer are derived. This results in the detennnation of an access sequence for each variable according to their indices together with their data fields, which have been mapped into a two-dimensional form. The access sequences then can be used for a computation of corresponding scan patterns and parameter sets. The computation of an access sequence is influenced by the mapping, the alignment, the index expressions, and the according loop l i m i t s (upper, lower, step width) [SI.
rALU Synthesis
The datapath synthesis system (DPSS) allows to map statemem from a high levcl language description onto the rDPA. The statemen& may contain arithmetic or logic expressions, condiiions, and loops, that evaluate iterative computations on a small number of input data.
The task of configuring the rDPA is canied out in the following four phases: logic optimization and technology mapping, placement and routing, U 0 scheduling, and finally the code generation. Partitioning of the statements onto the different rDPA chips is not necessary since the array of rDPA chips appears as one array of DPUs with transparent chip boundaries [6].
Conclusions
The paper has sketched the MOM 3, a prototype of a new data-parallel machine, called Xputer, and proposed a parallelizing compilation method for this machine. The method combines sumtural and procedural programming, according to the Xputer paradigm of a data sequencer hardware and an FPGA-based reconfigurable ALU. Due to the data sequencing. avoiding repetitive address computation, high performance factors can be achieved [l] , [5] . The parallelizing compilation method realizes the paradigm shift from von Neumann paradigm (imposed by the choice of C as input language) to the Xputer computing principles without further user interaction. The method compiles such that the special Xputer fine granularity is achieved together with a high hardware exploitation of the available resources. This fact guarantees high acceleration gairls. The implementation of the proposed compilation method is completed. The MOM-3 is currently being built. The address generator chips, the MOM-3 controller chip and the rALU control chip have returned from fabrication and are now under test. 
