AbstractÐPortable image processing applications require an efficient, scalable platform with localized computing regions. This paper presents a new class of area I/O systolic architecture to exploit the physical data locality of planar data streams by processing data where it falls. A synthesis technique using dependence graphs, data partitioning, and computation mapping is developed to handle planar data streams and to systematically design arrays with area I/O. Simulation results show that the use of area I/O provides a 16 times speedup over systems with perimeter I/O. Performance comparisons for a set of signal processing algorithms show that systolic arrays that consider planar data streams in the design process are up to three times faster than traditional arrays.
ae

INTRODUCTION
I MAGE processing applications are becoming increasingly important workloads as portable computing and wireless products, requiring still image or video features, become more prevalent. These applications demand more communication bandwidth and higher instruction throughput than is available from today's general-purpose microprocessors [1] . Advances in semiconductor technology can provide resources for these processing needs, but physical limits on on-chip interconnect and power consumption make scaling of current architectures difficult. High communication costs due to poor wire technology compel architectures to localize computation and reduce the demand for low latency, global communications [2] . New architectures must exploit physical data locality by processing data where it is stored [3] and where it enters the chip. This is especially critical in portable consumer products that are constrained by size, weight, power, and cost. This paper motivates the use of planar (two-dimensional) data streams to preserve physical data locality as it moves through a processing array. A new class of systolic arrays using these planar data streams is presented to increase computational throughput and efficiency. Systolic arrays were developed over 20 years ago to exploit the potential of VLSI technology [4] , [5] . Today, they can overcome the limitation of long-wire interconnect with modular computing cells that communicate locally. They are highly effective in regular, compute-intensive applications and are well suited to meet the growing demands of portable multimedia products.
A technique to synthesize systolic arrays is developed with extended graph information and modified projection vectors. These systolic arrays are collections of processors that use the planar data streams to exploit locality in the spatial orientation. A 16 times speedup in execution time is shown for the applications examined in this paper. While there are many technical challenges ahead to realize such a system, these results encourage researchers to consider using planar data streams rather than more traditional, linear approaches.
The organization of this paper is as follows: The class of image processing computation is first presented in Section 2 to set the scope of the systolic synthesis techniques. Section 3 then presents a summary of related work in systolic design and programming. Section 4 introduces a set of methodology to synthesize systolic arrays that uses planar data streams. Section 5 offers simulation results and comparisons with an example algorithm. Section 6 describes the technical challenges to build systems using planar data streams. Section 7 then concludes with directions of future research. Fig. 1 illustrates a generic image processing chain from image sensors to a compressed data stream [6] . The representative data types are shown above the processing chain. The image data collected from a color image sensor consists of a pattern of red, green, and blue information [7] . The color pattern comes from color sensitive CMOS or CCD elements and can differ among different sensor array designs produced by various sensor manufacturers. This image consists of high spectral information instead of spatial content and requires a color filter to extract the corresponding red, green, and blue image planes. Another processing block consisting of more traditional image filtering is required to finish the image. A compression scheme, such as JPEG for still images or MPEG for video, is required to process the image data for storage or transmission.
PLANAR DATA STREAMS IN IMAGE PROCESSING
Processing requirements for the image processing chain are compute and bandwidth intensive. As shown by the data types in the first two processing stages, the amount of data storage can be large if the images are buffered before computing. Today's processors rely on this store-andprocess methodology. Blocks of data are fetched repetitively from memory for processing. This serialized processing scheme limits the frame rate of image processing chain. Multimedia enhanced processors with SIMD extensions have been designed to harness the data parallelism in the image processing chain. Some examples of these multimedia extensions are Intel's MMX, Sun Microsystems' VIS for the SPARC, and Motorola's ALTIVEC for the PowerPC architecture. This paper proposes a methodology to partition computation in the image processing chain among an array of processors. The methodology is processor agnostic, but works well with larger amounts of small processing elements. Processing is performed in parallel following systolic methodology to maintain nearneighbor data communication among processors. A systolic synthesis approach is presented to partition the data into blocks of planar data for each processor.
This design methodology is unique because it considers the planar (two-dimensional) data types found in the image processing chain, as shown in Fig. 1 . The systolic synthesis procedure introduced in this paper considers the implementation constraints such as I/O connections and the dimensionality of data stream. Fig. 2 shows systolic array evolution in terms of data stream dimensionality. Early systolic arrays are either linear or planar arrays with peripheral I/O to feed data through the boundary [4] , [8] , [9] . These configurations are popular because low I/O dimensionality simplifies programming and design complexity. They are effective for problems with linear data streams such as speech processing and standard compression algorithms.
Systolic arrays with area I/O, shown in Fig. 2d , extend planar perimeter-bound arrays with an added I/O dimension, an evolutionary process similar to the linear array ( Fig. 2a to Fig. 2b ). In this configuration, each systolic cell has an I/O port to receive data, making specialized boundary cells unnecessary. Higher dimensionality in the I/O streams enables a more natural flow for image and video data as the entire image frame enters the array for processing in the same cycle. Conventional arrays require the image data to enter through the boundary, thereby limiting performances because they do not capture the inherent data parallelism in image processing applications.
The 2D data stream offers several key advantages for systolic arrays, previously not considered in conventional systolic arrays. With tighter coupling between processor and I/O, data is processed locally among systolic cells and the array remains busy during data loading process [10] . Area-I/O distributes staging memories among the processing elements and allows better scalability without large memory structures. The area I/O provides planar data streams for image processing applications and reduces the need to distribute data because the 2D data is spatially correlated between pins and systolic cells. 
RELATED RESEARCH
The use of regular computing arrays for solving iterative problems dates back over 40 years [11] , [12] . Research in the design and development of these arrays comes later with new VLSI technology advances when Kung and Leiserson introduced a new class of pipelined arrays, dubbed ªsystolicº arrays [4] , [5] . They presented an association of layout strategies to computations for signal processing, matrix linear algebra, and combinatorial algorithms. Systolic designs stress scalability, high degree of concurrency, and repetitive use of data to efficiently utilize the large number of transistors per integrated chip. The apparent simplicity in hardware through the use of modular cells intensifies this research area, as the system designer merely replicates the array pattern with more processing-elements for larger problem sizes.
Despite extensive research activities, systolic arrays remain difficult to design and implement. The design process requires detailed information for the traversal paths, computation sequences, and entrance/exit points of all data elements, including intermediate data streams. While the precise nature of this programming style makes systolic array implementation difficult, the detail scheduling of computation and communication events enables high performance, highly efficient operations.
There are several systematic methodologies for designing a systolic array, which includes transformation techniques from signal flow graph [8] , mathematical transformation on index set and dependence vectors [13] , and mapping procedures for multistage algorithms [14] . An earlier survey is available in [15] . More recently, systolic algorithms are reclassified as a proper subclass of Regular Iterative Algorithms (RIAs) [16] , which was introduced earlier as Uniform Recurrence Equations (UREs) [17] . The design methodology requires representations of data dependencies and functional requirements in a simple graph, often referred to as the Reduced Dependence Graph (RDG). Research in RIAs and UREs involved an RDG analysis to determine computability and to compute proper schedules [16] , [18] , [19] , [20] , [21] . These research approaches are inspired by algorithm requirements and are driven by development of automated design tools for data graph projection, mapping, and scheduling [22] , [23] .
Other systolic programming approaches involve language and compiler design to capture parallelism. Algorithms are written in a high-level language and a specialized compiler is required to map computation to the systolic array. Several examples include the Warp and iWarp optimizing-compiler [24] , [25] and the Brown Systolic Array [9] .
This paper extends the collection of research by investigating I/O interaction in algorithm development and performance. Concerns of processor interaction with planar data streams and area I/O are addressed, building upon earlier studies on the problem of loading and unloading data in perimeter-bound arrays [10] . RDG projection, mapping, and scheduling procedures are revised from current RIA methodologies [16] , [20] to handle planar data streams.
SYSTOLIC SYNTHESIS FOR PLANAR DATA STREAMS
This section presents several Reduced Dependence Graph (RDG) modifications to handle planar data streams.
Research efforts are described to improve computation latency and data locality. Constraints for planar data streams are considered during synthesis of an array of processors. While there are other methods to map computation on an array of processor [26] , [27] , the methodology presented here represents an approach that considers the I/O in a two-dimensional format.
Computation in the image processing chain, described earlier in Section 2, is first programmed in a flow-graph. Section 4.1 provides a brief discussion on the definitions and terminology for describing an RDG. This portion of the systolic methodology is well-researched [16] , [19] , [20] and leveraged in this paper. The RDG is arranged in a cubiclattice, as shown in Fig. 3a , in order to facilitate the choice of a projection vector for planar I/O. The shape of the lattice is cubic to align with synthesis constraints for arrays of image sensors or area I/O pins. Section 4.2 describes the projection vector for planar I/O, followed by descriptions of synthesis constraints. Technology issues are described later in this paper in Section VI.
Continuing on the description of the systolic synthesis methodology, Fig. 3b illustrates the method in which data is shared among processors. Data is transmitted between near neighbor processors to transfer data in a pipelined fashion. This scheme maintains the localized communication schemes in systolic computation model. Total communication distances are related to the data dependence distances in the RDG.
Execution latency of RDG depends on the arrival of data elements to the respective processors. Fig. 3c shows interprocessor communication with the dashed arrows. The RDG nodes are placed into iso-temporal hyperplanes (or ªsystolic time slicesº) in which they can execute without data hazards. The example in Fig. 3 shows three near neighbor processors (P1-P3) sharing a common data element. This data element is transmitted and the arrival time dictates the completion of the RDG. Fig. 3d shows the introduction of three key descriptors for the RDG. The three-dimensional dependency vector, orientation matrix, and movement vector are introduced to capture the arrangement of planar data as it moves through the cubic-lattice. These descriptions are described in detail in Section 4.3.
A systolic scheduler tool, described in Section 4.4, has been developed to automatically calculate the values in the dependence vector, orientation matrix, and movement vector for computation on area I/O arrays using planar data streams. Arrival times for any transmitted data are calculated to determine RDG depth and execution latency. Section 4.5 provides an example algorithm using the systolic synthesis methodology.
Definitions and Terminology
A brief discussion is provided on the definitions and terminology to describe RDG for Regular Iterative Algorithms (RIAs). While formal definitions are available elsewhere in [16] , [19] , [20] , they are provided here to facilitate further discussions. RIAs are algorithms with regular and local data dependencies. An RIA is defined by a set of computations, F, on a set of variables, V. A set of lattice points, or domain I, bounds the indices for F and V. The dependence graph of an RIA has a node for each variable and directed arcs representing dependencies between variable nodes. An RDG is a simplified graph where iteration versions for variables are labeled by indices under domain I rather than explicitly defined as different nodes. A connection matrix C and a dependence matrix D are used to represent elements of the RDG. Elements in C describe how the variable nodes are attached. Elements in D describe the dependence-distance between variable nodes along a dimension of domain I. Mathematical functions performed on a single column of C or D modifies a single edge in the RDG.
Systolic arrays extend RIA characterization with two indices: time and processor P domains. The time index is used to schedule the RIA by associating a systolic clock to computations and variables (F and V). Computation and variable nodes with the same systolic clock (those that can execute in the same cycle) form an iso-temporal hyperplane that can be derived from C and D [16] , [20] . The RIA is deemed computable if it has a valid schedule where all precedence constraints are not violated. Computability is established by ensuring that the RDG is acyclic in a bounded domain I. A specialized tree structured derived from the RDG can be created to determine computability [20] .
Once a schedule for computations and variables (F and V) is determined, the nodes are partitioned and mapped onto an array labeled with processor P indices. Because optimal schedules are extremely difficult to find, current techniques are restricted to linear partitions. All computations for a processor lie in the same parallel line that corresponds to an index in domain I. This line is determined from the normal vector u that is perpendicular to the isotemporal hyperplanes. The processor array (including the communication links) can then be obtained by projecting the RDG along the projection vector u onto a lower dimension lattice with P indices.
Projection Vector and Index Space
The projection vector u for systolic arrays with area I/O is aligned perpendicular to the area I/O or processor plane. Consider the following convolution algorithm:
where x is the image, h is a k-tap filter, and Y is the filtered image. The dependence graph after variable pipelining is illustrated in Fig. 4a . Traditionally, all of the computation (or graph nodes) would be assigned to a single processor. Fig. 4b shows a single-issue processor completing each computation node in order. As such, 12 clock cycles are required to complete the computation. In comparison, a multiprocessor system requires a partitioning between the computation nodes among the processor array. For this paper, only systolic mapping methodology is considered. As illustrated in Fig. 5a , traditional systolic mapping methodology will select a projection vector along the j axis [8] , [28] . This mapping consumes three systolic processors, P1-P3. A single processor, P1, handles all inputs and the output exits at the opposite edge with processor P3. Each processor holds a single filter coefficient resident in memory. This scheme requires four cycles to complete the example computation.
For a systolic array with area I/O in Fig. 5b , a projection vector along the k axis will be chosen instead. The projection direction follows the normal vector to the I/O plane for the output stream Y so that every processor (P1-P4) has I/O capability. This projection scheme requires three cycles to complete the example computation. For the algorithms discussed later in this paper, the projection vector is set with similar procedures to ensure the proper number of I/O ports and data flow for the image streams.
The conventional projection scheme (in Fig. 5a ) requires the output data to be pipelined through other processors. However, the area I/O projection scheme (Fig. 5b) maintains data locality of each image pixel. This strategy increases the number of processors and memory to hold filter coefficients, but, as a trade-off, computation latency can reduce. While the area I/O projection scheme may not be suitable for all systolic synthesis arrays, the methodology maximally utilizes available processors and I/O by allowing data input for each processor. Designs constraints due to physical implementation limitations are described in the following section.
Synthesis Constraints Due to Implementation
This above example shows the projection scheme for a linear image stream Y 0 -Y 3 . For a 2D image, the projection scheme in Fig. 5b can be extended to form a cubic lattice of computation nodes (in Fig. 6 ).
The number of area I/O ports restricts the range of indices in domain P. The number of pins required for an I/O port depends on bits-to-pin ratio (wide buses to bitserial links). This restriction on processor space imposes similar constraints on domain I in the computation lattice and requires computation to be partitioned among the fixed number of processors [13] . A block-checkerboard data partitioning schedule (or tiled partitioning) is used because this scheme provides the best data locality for image processing [27] , [29] . Each processor performs computation on its own block of data in parallel with other processors.
The synthesis methodology also considers physical implementation issues and relates back to the design methodology as constraints. For example, area I/O connections impose several communication patterns that require consideration during array synthesis. Fig. 6 illustrates the target domain of an RDG on a systolic array with area I/O. The chip orientation shows that data flows from the top (for optical signals) and bottom (for electrical signal with pins). Three planar data streams are shown as a basic example of an image processing application: Xin represents the input image stream, W is the filtering mask, and Yout is the output image. These data streams are placed in the processor coordinates system i; j; k and must be aligned to the chip orientation along with the projection vector u. The computation lattice, containing intermediate variable S and iterations of Xin, W, and Yout, must also be aligned with the spatial coordinates of the area I/O connections. The RDG applies for each index of Yout because of the data parallelism in image processing applications.
Area I/O restrictions on systolic array synthesis represent a bottom-up view on constraints from processor to algorithm domains. Area I/O connections limit the RDG index space and projection vector direction. However, opportunities exist to improve computation latency. The data streams can be reoriented with respect to the processor coordinates to reduce data dependence distances. The input data streams, Xin and W, can migrate in the computation lattice along the processor coordinate system to minimize communication [30] .
Data Orientation and Migration
Area I/O restricts movement when data is brought into the systolic array. In the same coordinate system shown in Fig. 6 , planar data enters along the k axis, but can move along the i and j axis once in the array because on-chip communication links offer higher levels of interconnectivity. Synthesis procedure must be modified to mimic this movement restriction when data enters the computation lattice.
As the planar data moves in a pipeline fashion through the computation lattice, dependencies are satisfied when data arrives at different processors. To reduce communication distances, dependence distance can be minimized by reorientation of the planar data. Fig. 7 shows a rotated or translated planar data in relation to the processor coordinate system. This reorientation occurs before the planar data enters the array. Once in the array, the planar data can migrate towards the processors that require the data. Unlike conventional mapping procedures where data moves in the same direction when it enters the array [31] , data migration allows movement along other axes.
Planar data orientation is achieved with an addition of an orientation matrix T for each variable node in the RDG. This matrix is similar to the transformation matrices in computer graphics algorithms to manipulate object vertices [32] . Elements in the matrix are calculated from dependence distances in matrix D and contain information to translate, rotate, and skew the planar data coordinates in domain I before mapping onto the processor space P. For example, if the RDG indicates that all data dependencies between W and Yout is +1 in the j axis, then the orientation matrix T contains a translation value of ±1 along the j axis for W. The orientation matrix T is applied only to filter values W to offer a more general synthesis procedure because filter values are typically stored in the array. Xin and Yout planar data can be restricted by external constraints (i.e., Xin could be images captured by a digital camera using optoelectronic devices and may not be reoriented without additional hardware).
Data migration occurs during execution and can be captured with an addition of a 2D vector M for each variable node in the RDG. Elements in the vector are calculated from dependence distances matrix D (and orientation T for filter values). Data migration is applied to the entire planar data stream and is restricted to near neighbor distances in each systolic cycle. For example, if the RDG indicates that dependency distances between Xin and Yout are k; 1; 0 in the i i; j; k axes, then Xin can move +1 along the j axis because dependency distances on the j are constant. Xin can migrate +1 along the i axis because dependency distances increases with the depth of the computation lattice, k. The iso-temporal hyper-planes are perpendicular to k axis so that the depth in the computation lattice represents the systolic cycles. Current synthesis procedure assumes a torus-mesh network in the array to enable the planar data to wrap-around after it moves beyond array edges. Together with data orientation during synthesis, data migration allows systolic arrays with area I/O to reduce communication distances by reducing the dependence distances during execution.
Scheduling Computation
Computation nodes in the cubic lattice shown previously in Fig. 6 and Fig. 7 requires an execution schedule to avoid data hazards. A symbolic scheduler has been built to identify the appropriate clock cycle for each computation node. As shown in Fig. 6 , the processors align with the cubic lattice such that each vertical slice of the lattice (or hyperplanes) identifies a systolic cycle in which all notes can execute without data hazards. Fig. 8 illustrates the concept of hyper-planes.
Because of resource constraints, all of the nodes in the same systolic cycle may not execute in a single processor cycle. The symbolic scheduler allocates more processor cycles for a systolic step, such that a systolic cycle is expanded or stretched in regular intervals (or processor clocks). Fig. 8 shows the scheduling policy with better cycletime resolution to further clarify the scheduling procedure. From the computation lattice, two hyper-planes at consecutive systolic cycles are considered. The data size has 16 pixels or output nodes represented by the individual squares. There are four processors in each quadrant, represented by the solid bars. Each processor is responsible for computation of the four pixel-nodes in its quadrant. Two types of events, S and Y, are labeled for different locations, representing the arrival of data so that the pixelnodes at those locations can execute. Arrows represent communication between processors.
In the example, systolic cycle n is expanded to four processor cycles because of communication constraints. There is only one channel available between processors, so a minimum of four cycles is needed to handle four communication events. In systolic cycle n+1, the top left processor has four S computation-events. However, the processor granularity limits the number of functional units to two, so that a minimum of two cycles is needed to compute S. Another processor cycle is needed to compute Y in the top left processor.
Computation on a processor must also be scheduled according to the arrival data elements over the communication network. The arrival time for any element in the planar data stream can be computed by solving equations in Fig. 9 for the systolic clock .
C represents the computation-lattice coordinates in domain I, while C H represents coordinates in processor space P, obtained by multiplying with the orientation matrix T. M is the movement vector, described previously. Subscripts associate the different variables to C, T, and M. w is the clock cycle when data element W arrives at the same processor space of S. The term w M w C H w represents the new coordinates after mapping W in the processor space and migrating for w cycle. L I represents the size of loops around the torus-mesh and the final equation is valid for integer values of . The same set of equations shown in Fig. 9 applies to X. Computation of S can occur when both X and W arrives, designated at the maximum between x and w . The procedure to find is applied to other variables dependent on S by traversing the RDG. Execution and communication delays can be added to for a more realistic result.
This symbolic scheduler is a generalized version of traditional multiprocessor scheduler [26] . It includes modifications to handle the area I/O projection schemes and the concept of systolic cycles described above. With this modified scheduler, there are some resources that are underutilized. Computation in the following systolic cycle may execute once data hazards are handled in the current systolic cycle, but the systolic cycle is extended because of resource constraints on other processors. In the example above, the top right processor is not operating in the second or third processor cycle in systolic cycle n+1. Despite resource fragmentation, this methodology allows for a straightforward scheduling procedure that does not violate the systolic partitioning in the computation lattice. The scheduling policy takes a less greedy approach to compute the algorithm. High performance in the simulation results shows that this scheduling policy is effective for the systolic array for planar data streams.
Algorithm Examples
Systolic arrays for several algorithms (matrix multiplication, Discrete Fourier Transform, 2D row/column convolution, and Discrete Wavelet Transform) were developed for analysis and comparison. The set of algorithms represents key operations in the image processing chain shown in Section 2. While there are many different ways to realize these algorithms [27] , the implemented systolic arrays provide an insight into communication patterns and synthesis requirements. The rest of this section describes the Discrete Fourier Transform (DFT) algorithm using the scheduling procedure described previously. Further details on the other implemented algorithms are available in [33] .
Discrete Fourier Transform (DFT)
The N-point DFT is defined below with x input image, W twiddle factors, and Y output.
The DFT can be viewed as evaluating the polynomial, where the product terms are cascaded.
Fig . 10 shows the communication patterns and RDG for the DFT for the image rows. The twiddle factor W is preloaded into the array as a 2D matrix. Y and W are set to move in orthogonal directions while X remains stationary in the same processor space. At every cycle in the first N cycles, product terms are generated. As Y moves, the product terms are summed. Another N cycles are needed to move the Y terms into the proper processor space. The minimum bound on computation latency is 2N cycles. The DFT algorithm has been designed such that the twiddle factors are placed strategically or preskewed to reduce dependence distances and, consequently, communication distances. The preskew information is determined by the systolic scheduler tool and is encoded into the orientation matrix. Fig. 10 illustrates the orientation matrix with an example of data arrangements.
Movements of the data elements such as twiddle factors and intermediate results are encoded in the movement vector. This methodology limits communication distance in a single cycle by transmitting data to near neighbor processors only. Computation is scheduled when data elements arrive so that execution time overlaps with communication. The RDG for DFT is shown with annotated dependency vectors. Because intermediate variables are introduced to represent data movements, the actual dependence-distance in the RDG is reduced [16] , [19] , [20] . In essence, computation of all variables is now dependent on data from near-neighbor processors and communication latency determines when the computation can execute. This scheduling scheme is described earlier in Section 4.4.
The concept of preskewing data elements has already been introduced. In Cannon's algorithm for matrix multiplication [27] , data elements that form a product term are preskewed to keep processors busy. This paper presents a similar concept for area I/O by encoding the skewing information in an orientation matrix for use in the systolic scheduler.
High dimensional algorithms tend to separable so that processing is done first on the rows and then columns. A 2D DFT includes an additional stage where W and Y movements are exchanged to perform computation on columns. The DFT can also be computed as a product of two matrices, but a different implementation is shown here to illustrate the different communication schemes in the systolic array. The data shuffling stage (after T ime N) is needed to reorganize the DFT outputs into the proper processor coordinates.
Performance Gains
This section presents a performance comparison of systolic arrays with planar data stream against conventional arrays [31] , [34] , [35] , [36] , [37] , [38] , [39] . The three different configurations are shown in Fig. 11 . The first configuration is a conventional array using peripheral I/O to bring in linear 1D data streams. Different parts of the image must be computed separately in rows and columns in order to perform the algorithm on the entire image. The second configuration is a tiling of the first array configuration to take advantage of the area I/O bandwidth. The tiled systolic arrays are smaller to maintain the same number of processors for comparisons. The third configuration is a systolic array specially designed for planar data streams. Pixel data are tiled as a 2D stream that enters into the array with area I/O pins. Fig. 12 shows cycle time comparisons for the three configurations. Execution latencies are normalized against the first configuration with linear data stream to show speedups. The first configuration represents systolic arrays that do not use the area I/O or rely upon the data parallelism in the algorithm. As expected, this configuration incurs high execution latencies compared to the other two configurations. The addition of area I/O provides an average speedup of 16. The second configuration uses the area I/O to capture some of the data parallelism. However, the tiling of multiple arrays does not exploit all the concurrency available in the application. The third configuration uses the techniques outlined in Section 4 to exploit planar data streams. For these four applications, this third configuration achieves as much as three times the performance of the second configuration, with no increase in required resources. This comparison shows the benefit of using the synthesis procedure described in this paper to capture interstream parallelism. While area I/O provides the majority of performance improvements from data parallelism, intermediate data streams that flow between the tiles of arrays in the second configuration must be considered. Using the synthesis procedures described in Section 4, a systolic array specially created for planar streams considers these intermediate data streams to minimize communication, to reuse data, and to reduce execution latency. Parallelisms between image rows are better captured with the 2D pixel tile data partitioning.
The systolic array for planar data streams performs well with a large number of processors in a single chip, favoring performance from data parallelism than clock frequency. Compared to conventional systolic arrays, the array using planar data streams trades off chip area for performance. Area I/O delivers the necessary image data to appropriate processors, thereby maintaining data throughputs to keep processors busy. This design configuration distributes processing units and memory among the array and avoids centralized functional units. Communication is localized and fast interactions in long global interconnects are minimized.
The performance comparisons described in this section are performed using published technology projections [40] . The capabilities of processors in each configuration are assumed to be similar and, therefore, clock frequency and processors-count remain constant for all configurations for a particular technology in the year 2012. Further details on the simulation approach are available in [33] .
Technical Challenges
While the systolic synthesis methodology presented in this paper provides a sizeable amount of performance improvements, there exist technical challenges to implement such a system. This section describes several key technology advances to support systolic arrays with planar I/O. New technology in Gigascale Integration (GSI) offers new resurgence in systolic array research with billion-transistor chip operating at gigahertz clock frequencies [41] . Technology advances continue to inspire systolic array research to improve node communication, I/O bandwidth, and programmability. Higher transistor density and die sizes allow for a larger system size (more processing elements) to handle larger problem sizes. Fig. 13 shows several technologies that enable multidimensional data streams. Current packaging technologies such as ball grid array, flip-chip bonding, and chip-scale packages already provide area I/O by utilizing the full area under the chip package for connections rather than the periphery [42] , [43] , [44] . Development in chip-scalepackaging shows additional promises to support a package pin count in excess of 3,000 by the year 2003 [40] . Silicon-oninsulator technologies enable 3D integration of active (silicon) layers for higher packing density and speed performance [45] . Via connections between silicon layers provide a connection scheme similar to area I/O pins. Stacks of wafers have also been built with these via connections to provide the planar I/O [46] . Another key technology includes the use of integrated CMOS sensors to provide 2D data streams [47] , [48] , [49] . Images are focused on an optical focal plane consisting of a layer of optoelectronic devices stacked above the systolic array.
Designers have already considered multidimensional systolic array as several 3D systolic arrays have been reported using 3D VLSI technology [31] , [34] , [35] . However, these arrays are not designed for planar data streams. They are created by stacking planar arrays and operate on single dimension I/O. While these systems show throughput improvement, their implementations only consider the partitioning of systolic cells into different layers and not the interaction between processor and I/O. Benefits in their 3D systolic arrays come mainly from the increased network bisection bandwidth between layers of systolic cells.
While new design opportunities with 3D VLSI are available, implementation constraints such as power consumption, packaging, and fabrication restrictions become critical. For example, clock synchronization can be difficult between sensors and processors on the same chip and across stacks of processors. Asynchronous clocking with appropriate buffering can alleviate clock skew problems but this solution requires silicon chip area that can be dedicated to processing elements. In addition, heat dissipation in 3D systolic arrays can be difficult, especially from the center of the stack of processors. The discussion of all of the implementation issues is beyond the scope of this paper. However, the performance gains presented in this paper motivates the use of synthesis methodologies described in this paper with the use of planar area I/O.
Concluding Remarks
This paper presents a new class of systolic systems employing planar data streams to enable a natural flow for image and video data. Key technology innovations such as area I/O are used to provide direct access to all systolic processors. Synthesis procedure using dependence graphs, data partitioning, and computation mapping are developed to exploit the physical locality of planar data streams by processing the data where it enters the chip. Results show that while systolic designs using area I/O can capture a large a percentage of natural data parallelism, additional synthesis procedures presented in this paper exploit locality, capture stream level parallelism, and provide up to three times improvement in execution latency. 
