Dataflow languages provide a high-level description that can expose inherent parallelism in many applications. This high level description can be applied to automatically create efficient code and schedules based on patterns in the dataflow graphs and knowledge of the target architecture. When targeting a dataflow graph to custom hardware, it is sometimes advantageous to share nodes with similar functionality to save silicon. Any state information associated with the caller of the shared node must be stored and subsequently loaded upon firing. If prediction logic can predict which caller of a shared node is next, the associated state information can be prefetched while other nodes of the graph are executing. While some applications can be entirely scheduled at compile time, many multichannel measurement and control applications require some degree of dynamic scheduling. This paper's key contribution is a lightweight call prediction unit with 100% prediction accuracy given a runtime -determined periodic calling schedule. While applications are varied, we show a 33% speedup in a filtering application possible in wireless ad hoc networks.
INTRODUCTION
Many measurement and control applications attempt to accomplish similar tasks on multiple channels. When targeting systems to custom hardware, sharing functionality between multiple channels can save silicon. Many times these functional units contain state information that must be loaded from memory before execution, and prefetching this state information can enhance system performance. The sequence of callers can be determined at compile time if the application is statically schedulable. If the sequence is dynamic in nature, a prediction mechanism is necessary to support prefetching.
Our main objective is to explore the feasibility of caller prediction as a method for predicting the next caller of a shared functional unit executing in a parallel processing system. Dataflow graphs are good at describing parallelism, and thus are a natural fit for modeling systems where the caller prediction mechanism may be beneficial. We examine related work in classifying dataflow graphs, and build on the concept of branch prediction in pipelined processors to implement our prefetch prediction unit for dynamically scheduled dataflow graphs.
RELATED WORK
Statically schedulable models of computation do not need prediction because all scheduling decisions are made at compile time; however, they can still take advantage of prefetching state data. Synchronous dataflow (SDF), computation graphs, and cyclo-static dataflow (CSDF) are all powerful models of computation for applications where the schedule is known at compile time [4] . For a valid schedule, it is possible to speed-up the process by simply pre-loading actors and their respective internal state data. Figure 1 shows the prefetch nodes explicitly in the diagram; however, the prefetch nodes could be added implicitly when targeting hardware capable of taking advantage of the parallelism exposed in the dataflow graph. The idea of prefetching data is not new. Cahoon and McKinley have researched extracting dataflow dependencies from Java applications for the purpose of prefetching state data associated with nodes [3] . Wang et al., when exploring how to best schedule loops expressed as dataflow graphs, also try to schedule the prefetching of data needed for loop iterations [6] .
Dynamically schedulable models
While statically schedulable models of computation are common in signal and image processing applications and make efficient scheduling easier, the range of applications is restrictive because runtime scheduling is not allowed. Dynamically schedulable models of computation, such as Boolean dataflow, dynamic dataflow, and process networks, allow runtime decisions, but in the process make static prefetch difficult, if not impossible. Figure 2 illustrates a homogeneous and quasi-static Boolean Data Flow graph. As J. T. Buck explains in his thesis, Boolean Dataflow (BDF) is a model of computation sometimes requiring dynamic scheduling [2] . The switch and select actors allow conditional dataflow statements with semantics for control flow as shown in Figure 2 . Until the switch node has executed, it is impossible to know whether FIR 1 or FIR 2 will be executed and so without prediction logic prefetching of node state data cannot occur until after the switch statement. [1] . The case structure and looping constructs force the language to be dynamically scheduled. Subsets of the G language can be detected at compile time as statically schedulable.
Local variables, global variables, and other features of G make it possible to implement general process networks in LabVIEW ™ . The dynamic, yet structured nature of G makes certain subsets well suited to the exploration of prefetch prediction for shared G nodes. We used G as our prototyping environment for practical reasons.
Branch Prediction
Many modern pipelined processors use branch prediction to improve performance by allowing a processor's pipelines to remain full for a greater percentage of the time. Branch prediction reduces the number of times the pipelines must be flushed, a necessary action when the processor can't predict the next instruction following a branch instruction. If a processor can't predict the next instruction, it can't start decoding and executing it. Many methods of branch prediction exist to help the processor predict the next instruction.
One method of branch prediction uses a saturating 2-bit counter that increments whenever a branch is taken and decrements whenever a branch is not taken. The most significant bit of the counter then becomes the prediction of whether a branch will be taken or not, as illustrated in Figure 3 . Two-level branch prediction was pioneered by Patt and Yeh to help keep processor pipelines full for a greater percentage of the time [8] . This prediction model uses a lookup table of saturating two-bit counters that represent the likelihood that a branch will be taken given a certain history. As illustrated in Figure 4 , the history register consists of data indicating if a branch was taken, or not taken, the past n times. A '1' represents a taken branch, and a '0' represents a branch not taken. The table therefore  has 2 n entries. A '1' in the counter MSB predicts that a branch will be taken, while a '0' predicts that the branch will not be taken. This approach achieves a prediction success rate in the 90% range [5] . Patt and Yeh have tabulated the hardware costs to be significant for large history lengths [7] . 
IMPLEMENTATION
Users of dataflow languages such as G can place nodes on a diagram and specify whether the nodes share logic. In G, shared logic is placed in SubVIs and accesses are made sequential by specifying the SubVI as non-reentrant. If sequential access can be statically scheduled by diagram analysis, arbitration between the callers is not necessary. Furthermore, prefetching of state data can be statically scheduled and no prediction unit is necessary. If, however, access to the shared node cannot be statically scheduled, runtime arbitration between the various callers must be provided as shown in Figure 5 . Different methods for scheduling access between contending callers exist. G currently uses a fair round-robin scheduler to avoid starvation. Other methods such as earliest deadline first and rate-monotomic scheduling have been shown to improve meeting deadlines in real-time systems [4] .
In a dynamically scheduled dataflow graph, the control and arbitration unit can attempt to predict the next caller of the shared functional unit and prefetch any state data necessary for execution. The shared functional unit and the plurality of callers can execute in parallel on a parallel execution unit such as a multi-processor computing device capable of parallel execution, configurable hardware elements such as FPGAs, and non-configurable hardware elements such as ASICs.
Two-Level Based Caller Prediction
Our initial pass at a design of our prediction mechanism borrowed heavily from the two-level branch prediction mechanism. Figure 6 shows the overall block diagram architecture. The past call history shift register keeps track of the last n calls to a shared node. The values held in the shift register are wired directly to a hash function that converts the history entry to a row lookup index into the call history Table  Update   Table  Lookup 00 11 01 . . . Table   0 . . . Ideally, there would be a row in the history table for each possible value of the past history register. However, the number of rows grows exponentially (x n ) with the length of the history register. For three callers (x) and a history length of eight ( n), the direct map approach requires 6561 rows. A hashing function is necessary to eliminate this exponential dependence.
Call History
In the above predictor model, the maximum number of callers is analogous to a numeric base, and each position in the history shift register is analogous to a position in the numeric base. For example, for ten possible callers, a history register containing 9,6,2 maps to a lookup index of 962 using base 10. For 11 possible callers, a conversion is performed by evaluating 9*11^2 + 6*11^1 + 2*11^0. A hardware modulo hash function can convert and multiply the running sum by a large prime number during each stage, but would only use the lower k bits of the direct mapped result. Hashing introduces collisions, which decrease the prediction accuracy.
The large size of the history table combined with the necessity of hashing and consequently decreased prediction accuracy makes the practicality of the two-level based prediction unit questionable.
Key Contribution -Period Prediction
If, however, our application settles to a periodic schedule quickly, and the schedule changes infrequently, the need to keep track of past periodic behavior is eliminated. A predictor model that determines the current periodic behavior in the current calling history is illustrated in Figure 7 . The call history shift register in the period predictor is similar to the past history register of the two-level predictor. The numeric identifier of the most recent caller is shifted in on the left while the oldest caller identifier is shifted out on the right. The period predictor first splits the history register in half, and then finds the part of the second half which best correlates with the beginning of the first half. This correlation is done using simple element-by-element equality comparisons ANDed together. All equality comparisons can occur in parallel. The results of the equality comparisons are used in a priority MUX to select one element from the history register as the prediction. For example as shown in Figure 7 , for a past call history of 12312312, the third equality is true, and the period predictor selects 3 as the prediction. Similarly, for a history of 02345602, the third equality is true, and period predictor selects 6 as the prediction.
These comparisons assume that the period of the calling sequence is contained in the history register, and that call prediction latency is equal to the length of the history register if the period or the calling sequence changes.
Application Criteria for Period Prediction
Period prediction is not intended for every dynamically scheduled application. Instead, it can be used in the application area between applications requiring a full-featured prediction unit and statically scheduled applications.
First, the appropriate application should be dynamically scheduled and have a subcomponent that contains a significant amount of logic such that is worth sharing. Timing relationships in this application should not be affected by the sequential nature of the calls. In addition, each caller of the shared node should hold a significant amount of state data that could benefit from prefetching.
The maximum length of any periodic calling sequence must be known at synthesis time to size the history register correctly. Of course, this could be implemented in RAM, but then a significant penalty is incurred on sequential access of the elements for the comparison operations. If the maximum length of any periodic calling sequence is not known, an educated guess can be made. However, if the prediction is wrong, the application does not receive the advantage of prefetching. It should be noted that if the predictions cannot be 100% accurate, jitter will result in the execution time of the node.
Jitter will occur since correct predictions speed up the execution and wrong predictions slow down the execution. Therefore the application should tolerate a varying quality of service if predictions are not 100% accurate.
A possible application can dynamically vary the sample rate depending on how much detail on input signals is desired. The fact that data acquisition with a higher sampling rate uses more power than data acquis ition with a lower sampling rate becomes important for applications that use a limited power source, such as a battery. For example, many wireless monitoring applications have such power conservation needs. It is advantageous for wireless monitoring applications to sample at higher rates only during those times when it is needed, thus conserving power.
There are many wireless monitoring applications that can utilize an ASIC, such as environmental monitoring applications, e.g., earthquake detection systems and remote weather stations. In addition, medical monitoring applications may require monitoring of patient physiological conditions at all times. A patient that cannot be connected to a stationary monitor would need a wireless and battery powered monitor. The wireless system could acquire signals from sensors, such as electrocardiogram electrodes, on the patient's body. Use of the variable sampling rate can significantly extend the battery life of this wireless monitoring system.
Another related application is remote wireless monitoring of wildlife. Here the wireless monitoring device could be strapped to a wild animal for years and so would have to be very low power. As described above, the sample rate could be varied based on the activity level of the animal. The two filtering channels, described below, may acquire monitoring heart rate and breathing. The filter coefficients are changed based on the effective sampling rate, which is based on the activity level of the animal.
RESULTS
We used a simple LabVIEW ™ filtering application (Figure 8 ) to test our predictor models. The timer blocks provide the dynamic sample rate for the analog to digital converters (ADCs) and for the digital to analog converters (DACs) on the two independent channels. The samp le rate also determines which set of coefficients will be loaded into the FIR block for filtering. Note a more complex example could be constructed that adaptively modifies the filter coefficients based on the actual sample rate. In the analysis that follows, the ADCs and DACs (blocks A, C, D, and E) take 20 cycles to execute, fetching the coefficients and current tap values (blocks F1 and F2) takes 31 cycles, and executing the shared FIR filter block (blocks B1 and B2) takes 6 cycles. Figure 9 illustrates a schedule with three parallel threads of execution without prefetch prediction. For our analysis, we assume that loop 1 runs twice as fast as loop 2, however this can change dynamically. The top row shows the execution time of each block, the second row shows the execution of blocks in loop 1, the third row shows when the shared FIR block executes, and the last row shows the execution of blocks in loop 2. Fetching occurs inline with the loop executions since it is not known which loop will call the FIR block next.
On the other hand, prefetch prediction tries to prefetch the data for the next caller while other blocks execute, as shown in Figure 10 . We have noted a 33% improvement in system speed when our predictions are correct. For a period prediction unit with a history register long enough to contain the entire period, our predictions will always be correct. If our predictions are incorrect, our sampling schedule would shift, and our sampling rate would vary. A varying sample rate is acceptable for many feedback control applications, but not for most signal processing applications.
Before implementing prefetch prediction, we must examine if the performance enhancements outweigh the silicon cost. We are prototyping using LabVIEW ™ FPGA to synthesize for a Xilinx Virtex II™ to provide the parallel processing for implementing our dataflow graphs. The two-level lookup prediction approach uses over 500 slices, even without the block memory for the lookup table.
As a point of reference, a symmetric 32-tap 32-bit FIR filter uses around 400 slices. The complexity and size of the two-level predictor makes its current practicality questionable.
The simplicity of the period predictor approach yields better implementation results. Figure 11 shows the size of the unit for varying history lengths and varying number of callers. The size of the periodic predictor is reasonable for a history length of 32 or less, due to the special logic and routing resources in the Virtex II™ FPGA that implement the final mux in the period predictor [9] . The latency needed for a prediction is as low as one or two cycles, as only the parallel equality tests and the subsequent select are necessary.
CONCLUSION
The concepts of prefetching and branch prediction serve as the basis for our prefetch caller prediction of shared nodes on dataflow graphs. Shared nodes are common in multichannel measurement and control systems. We have examined two mechanisms for implementing caller prediction, and have an implementation preference for detecting periodic behavior in the calling sequence. This simple, lightweight call period detection circuit is the main contribution of our research. We showed a performance improvement of 33% for a simple multi-channel application that could implement a portion of a wireless data acquisition ASIC.
