In the automotive industry, there is currently great interest in utilizing computer vision algorithms to support driver-assist and autonomous-control features. OpenVX is an emerging standard for supporting workloads in which such algorithms are applied. OpenVX uses a graph-based software architecture designed to enable efficient computation on heterogeneous platforms that may include CPUs, graphics processing units (GPUs), digital signal processors (DSPs), and other accelerators. Unfortunately, in settings where real-time constraints exist, the usage of OpenVX poses certain challenges. In a recent paper, the authors presented a new implementation of OpenVX directed at platforms comprised of CPUs and GPUs that leverages various analytical techniques to enable these challenges to be addressed. In this paper, these analytical techniques are presented and discussed in detail. These techniques enable endto-end frame processing times to be analytically bounded under OpenVX while encouraging parallelism through pipelining. Additionally, they enable bounds on frame buffering requirements to be determined.
Introduction
In the automotive industry today, vision-based sensing through cameras is being used to support features such as automatic lanekeeping, adaptive cruise control, etc. In the coming years, such features are expected to evolve and become integrated with actuation logic that supports partial or full autonomy. To enable cost-effect deployments of such features, within an acceptable size, weight, and power envelope, multiple vision-based processing streams must be consolidated onto a single hardware platform that may include components that accelerate certain computations. Such a consolidation must be done in a way that enables real-time requirements to be validated.
For computer vision algorithms, graphics processing units (GPUs) are a particularly compelling accelerator to consider, as GPUs are ⇤ Work supported by NSF grants CNS 1115284, CNS 1218693, CPS 1239135, CNS 1409175, and CPS 1446631, AFOSR grant FA9550-14-1-0161, ARO grant W911NF-14-1-0499, and a grant from General Motors.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. well suited for efficiently performing the matrix-oriented computations inherent in many computer vision applications. To ease the development of such applications on heterogeneous platforms such as those in which GPUs are employed, and to enable system-level optimization [1] , a standard computer vision API called OpenVX has been created and ratified [2] . Unfortunately, several aspects underlying the design of OpenVX make validating real-time requirements problematic, despite the fact that real-time applications are an intended use case [3] . This is disconcerting, given that OpenVX undoubtedly will be adopted as a standard in many settings where such requirements exist. Problems with OpenVX. The OpenVX API provides the programmer with a set of basic operations, or primitives, commonly used in computer vision algorithms. 1 A computer vision algorithm is constructed by instantiating primitives as nodes and linking node outputs to node inputs to create a computer vision processing graph.
OpenVX has a simple execution model. From Sec. 2.8.5 of the OpenVX standard [2] : "[A constructed graph] may be scheduled multiple times but only executes sequentially with respect to itself." Moreover: "[Simultaneously executed graphs] do not have a defined behavior and may execute in parallel or in series based on the behavior of the vendor's implementation."
This model simplifies the API and implementation of OpenVX and allows it to perform well on platforms with a wide range of capabilities, ranging from simple ASICs to complex multicore+GPU platforms comprised of multiple CPUs and one or more GPUs. However, this model imposes three significant implications on real-time scheduling. First, the specification has no notion of a repeating (i.e., periodic or sporadic 2 ) task, and lacks any framework for real-time analysis. With respect to analysis, a key issue is the allowance of "back-edges" that can create cycles in a graph. Second, the specification does not define a threading model for graph execution. Finally, it requires a graph to execute end-to-end before it may be re-executed. This significantly hinders the ability to exploit parallelism by "pipelining" portions of a graph's structure to improve performance.
In a recent paper, we described a new OpenVX implementation devised by us that addresses all of these problems [4] . This new implementation extends an OpenVX implementation by NVIDIA called VisionWorks [5] and is directed at multicore+GPU platforms. Our extended version of VisionWorks is structured in a way that enables previously proposed analytical techniques to be adapted to 1 In OpenVX, these basic operations are called "kernels." 2 We assume familiarity with the sporadic and periodic task models. bound end-to-end frame processing times and overall frame buffering requirements, both within an execution model that encourages parallelism through pipelining.
Contributions. From an implementation point of view, our extended version of VisionWorks is rather complex-in total, we added approximately 34K lines of code to VisionWorks. As a result, the presentation in our prior paper [4] is primarily directed at implementation details and a case study-needed analytical results are only briefly sketched. The main contribution of the current paper is to present these results in much greater detail. Specifically, we present an overview of the prior analytical results being leveraged, and explain in detail how these results can be applied to derive response-time bounds and buffer bounds. Because the two papers are linked-this one focusing on analytical issues and the prior one [4] focusing on implementation details-there is necessarily some overlap in presentation. 3 Organization. The remainder of this paper is organized as follows. We begin by describing more carefully how OpenVX graphs are defined (Sec. 2), the real-time-related challenges pertaining to such graphs (Sec. 3), and the prior work we leverage to address these challenges (Sec. 4). We then explain how to apply this prior work in our setting to obtain bounds on end-to-end processing times (Sec. 5) and overall buffer-space requirements (Sec. 6). Following this, we conclude (Sec. 7).
OpenVX
Computer vision algorithms are commonly expressed using dataflow graphs. An example is given in Fig. 1 , which depicts a simple pedestrian detection application that could be used in an automotive application. In this example, a video camera feeds the source of the graph with video frames at 30Hz (or 30FPS). The first node converts raw camera data into the common YUV color image format. The second node extracts the "Y" component of each pixel from the YUV image, producing a grayscale image. (Computer vision algorithms often operate only on grayscale images.) The third node performs pedestrian detection computations and produces a list of the locations of detected pedestrians. In this case, the node uses a common "soft cascade classifier" [6] to detect pedestrians. Finally, the last node displays an overlay of detected pedestrians over the original color image. To support this pedestrian detection application in a real-time setting, we require a task model and implementation that will allow us to exploit the parallelism inherently expressed by the graph, while still supporting real-time analysis and predictable execution.
OpenVX is a newly ratified standard API for developing computer vision applications for heterogeneous computing platforms. The API provides the programmer with a set of basic operations, or primitives, commonly used in computer vision algorithms. 1 The programmer may supplement the standard set of OpenVX primitives with their own or with those provided by third-party libraries. Each primitive has a well-defined set of inputs and outputs. The implementation of a primitive is defined by the particular implementation of the OpenVX 3 The greatest overlap occurs within Secs. 1-3. standard being used. Thus, a given primitive may use a GPU in one OpenVX implementation and a specialized DSP (e.g., CongniVue's G2-APEX or Renesas' IMP-X4) or mere CPUs in another. OpenVX also defines a set of data objects. Types of data objects include simple data structures such as scalars, arrays, matrices, and images. There are also higher-level data objects common to computer vision algorithms-these include histograms, image pyramids, and lookup tables. 4 The programmer constructs a computer vision algorithm by instantiating primitives as nodes and data objects as parameters. The programmer binds parameters to node inputs and outputs. Since each node may use a mix of the processing elements of a heterogeneous platform, a single graph may execute across CPUs, GPUs, DSPs, etc.
Node dependencies (i.e., edges) are not explicitly declared. Rather, the structure of a graph is derived from how parameters are bound to nodes. We demonstrate this with an example. Fig. 2(a) gives the relevant code fragments for creating an OpenVX graph for pedestrian detection. The data objects imageRaw and detected represent the input and output of the graph, respectively. The data objects imageIYUV and imageGray store an image in color and grayscale formats, respectively. At line 12, the code creates a color-conversion node, convertToIYUV. The function that creates this node, vxColorConvertNode(), takes imageRaw and imageIYUV as input and output parameters, respectively. Whenever the node represented by convertToIYUV is executed, the contents of imageRaw is processed by the color-conversion primitive, and the resulting image is stored in convertToIYUV. Similarly, the node convertToGray converts the color image into a grayscale image. The grayscale image is processed by a user-provided node created by the function mySoftCascadeNode(), which writes a list of detected pedestrians to detected. 5 Fig. 2(b) depicts the bindings of parameters to nodes. Fig. 2 (c) depicts the derived structure of this graph.
Our implementation of OpenVX, described in [4] , is directed at multicore+GPU platforms and extends an OpenVX implementation by NVIDIA called VisionWorks. Specifically, a GPUmanagement framework developed previously by our group called GPUSync [7, 8, 9] is used along with an additional middleware layer. GPUSync treats GPUs as resources that may be acquired and released by tasks by invoking multiprocessor real-time locking protocols. A fairly comprehensive overview of this implementation is given in [4] ; further details can be found in the second author's Ph.D. dissertation [9] .
Ensuring Conformance to an Analyzable Task Model
The timing constraints of interest to us pertain to end-to-end graph processing times, i.e., the duration of time from when an input frame is consumed by a source node to when any corresponding output is generated by a sink node. In particular, we require that such processing times are provably bounded. As we explain in detail later, such bounds can be obtained by adapting prior results of Elliott et al. [8] , which are in turn based on even earlier results of Liu and Anderson [10] , with synchronization-related blocking due to the usage of GPUSync accounted for using blocking bounds from [9] . However, to apply these results, no cycles may exist in any processing graph. Also, each node of a graph should be viewed as an individual schedulable entity, rather than the entire graph, to enable parallelism due to pipelining effects. Unfortunately, the VisionWorks framework that we modified fails to satisfy any of these requirements, hence the need for our modifications.
Graph dependencies and pipelining. Recall from Sec. 2 that OpenVX does not pass data through graph edges. Rather, node input and output is passed through singular instances of data objects. Although graph pipelining is naturally supported if nodes rather than entire graphs are schedulable entities, a new hazard arises: a producer node may overwrite the contents of a data object before the old contents have been read or written by a consumer node! Such consumers may not even be a direct successor of the producer. For instance, we can conceive of a graph where an image data object is passed through a chain of nodes, each node applying a filter to the image. The node at the head of this chain cannot execute again until after the image has been handled by the node at the tail. In short, the graph cannot be pipelined.
This pipelining issue can be resolved by replicating data objects, as illustrated in Fig. 3 . However, replication alone is not a sufficient solution unless safe replication bounds can be determined that are sufficient to ensure that no data object is prematurely overwritten before being consumed. Later, in Sec. 6, we explain how to obtain such bounds.
Back-edges.
Computer vision algorithms that operate on video streams often feed data derived from prior frames back into the computations performed on future frames. For example, an object tracking algorithm must recall information about objects of prior frames if the algorithm is to describe the motions of those objects in the current frame. OpenVX defines a special data object called a "delay," which is used to buffer node output for use by subsequent node invocations. A delay is essentially a ring buffer used to contain other data objects (e.g., prior image frames). The oldest data object is overwritten when a new data object enters the buffer. The number of data objects stored in a ring buffer (or the "size" of the delay) is tied to how "far into the past" the vision algorithm must go. For example, consider a node that operates on frame i and requires access to copies of the last two prior frames. In this case, the size of the delay would be two. The consumer node of data buffered by a delay may appear anywhere within a graph. It may be an ancestor or descendant of the producer node-it may even be the producer itself. A back-edge is created when the consumer node of a delay is not a descendant of the producer node in the graph derived from non-delay data objects. For example, in Fig. 4 , which is taken from the case study presented in [4] , the delay edges sourced from the "Harris Feature Tracker" node are back-edges; the other delay edges are not. As seen in Fig. 4 , back-edges ostensibly result in cycles. This is problematic because the prior end-to-end response-time analysis we leverage applies only to acyclic graphs. In Sec. 5, we explain how to break such cycles.
As the discussion above suggests, the analytical results we desire extrapolate heavily from prior work on graph-based task systems. Before delving into the details of how we addressed the problems noted above, we first review this prior work.
The Sporadic DAG Model
There is a growing body of work on real-time analysis methods for systems specified using graph-based formalisms and other formalisms that expose parallelism (e.g., see [11, 12, 13, 14, 15, 16, 17, 18] and the references cited therein). Our formal analysis here is obtained by considering the implicit-deadline sporadic DAG task model, which has been the subject of prior research by our group [19] . The following description of this model is largely taken from [19] with minor modifications to suite our needs here. Task model. We consider a system comprised of a set t = {t 1 , t 2 ,...,t n } of n DAGs. Each DAG is a set t i = {t 1 i , t 2 i ,...,t .. An unfinished job J v i, j is ready if it has been released and if J v i, j 1 (if j 2) has completed execution. An example DAG t 1 is depicted in Fig. 5 . As seen in this example, tasks (nodes) may be connected by edges. Each edge is directed from a producer task that produces data to a consumer task that consumes that data. A particular task t v i 's producers are those on edges for which t v i is the consumer, and its consumers are those on edges for which t v i is the producer. Each job must wait to begin execution until one job from each of its producers has completed, so that its necessary input data is available. For example, in Fig. 5 , for any j, J 4 1, j needs input data from each of J 2 1, j and J 3 1, j , so it must wait until those jobs complete. To simplify analysis, we assume that each DAG t i has exactly one source task t 1 i , which only has outgoing edges, and one sink task t zi i , which has only incoming edges. Multi-source/multi-sink DAGs may be supported with the addition of singular "virtual" sources and sinks that connect multiple sources and sinks, respectively. Each DAG has a common period parameter T i for all of its tasks-we explain how this parameter is interpreted when discussing scheduling below. Each task t v i also has a parameter C v i , which denotes the worst-case execution time (WCET) for any of its jobs. We assume that t is scheduled on an identical multiprocessor. For now, we also assume that all tasks are independent. Later, we explain how to deal with dependencies created when tasks share GPUs. Scheduling. The results of this paper can be applied to any system of DAGs where tasks are scheduled via any window-constrained global scheduler [20] ; however, for ease of exposition, we specifically focus on the most widely studied such scheduler, the global earliest-deadline-first (G-EDF) scheduler. Under G-EDF, ready jobs are prioritized for scheduling on an earliest-deadline-first basis, any job may execute on any processor, and jobs may be preempted or may migrate among processors, except when executing within a non-preemptive section (e.g., when accessing a GPU). On large platforms, global algorithms such as this can be applied within clusters of processors, so our results can be adapted for applicability on such platforms as well.
As in [19] , we assume that tasks corresponding to source nodes release jobs sporadically; that is, job releases of the task t 1 i must occur at least T i time units apart. As noted above, a task corresponding to a non-source node releases its jobs as the data they require becomes available. As seen in the example schedule in Fig. 6 , this can cause consecutive jobs of the same non-source task to be released fewer than T i time units apart. However, the deadlines corresponding to those jobs must still be defined to be at least T i time units apart, as the figure shows for the the task t 2 1 . In particular, note that jobs J 2 1,1 and J 2 1,2 are released only 7 time units apart, which is less than T 1 = 8, yet their deadlines are defined to be 8 time units apart. The technique used here for defining deadlines is called deadline postponement and dates back to early work on rate-based scheduling [21] . The sporadic DAG systems considered here are special cases of DAG-based systems that can be specified using the more general processing graph method (PGM) [22] , the real-time scheduling of which has been studied in the context of both uniprocessors [23] and multiprocessors [10] . In PGM, the movement of data through a DAG is abstracted by considering the transmission of tokens from producer to consumer tasks. The rules that govern how tokens are produced and consumed are quite general, and as a result, the manner in which non-source tasks release jobs becomes more complicated. This level of generality is not needed in the application domains that are the subject of this paper.
End-to-end latency bounds. Define the utilization of the task t v i to be U v i = C v i /T i , and the total system utilization to be U sum = Â i,v U v i . Assume that the considered hardware platform has m processors. Then, as long as U v i  1 holds for each i and v, and U sum  m holds, it can be shown that any task in any DAG has bounded deadline tardiness. In the context of the more general PGM model, this result was first established by Liu and Anderson [10] by leveraging prior work on tardiness bounds under G-EDF by Devi and Anderson [24] . In the context of the simpler sporadic DAG model, Elliott et al. [19] used these earlier results to establish per-task end-to-end latency bounds. Specifically, let t 0 denote the set of independent implicitdeadline sporadic tasks corresponding to the sporadic DAG task system t, i.e., each task t 0v i in t 0 has the same period and WCET as the corresponding task t v i in t. Then, the deadline tardiness of any task t 0v i in t 0 is guaranteed to be at most D v i time units, where D v i is defined according to an expression given in Theorem 1 in [24] . Based on this, Elliott et al. [19] established an end-to-end latency bound L v i for each task t v i in the original sporadic DAG task system t. L v i upper bounds the difference f v i, j a 1 i, j , where a 1 i, j denotes the release time (or activation time) of the j th job of the DAG t i 's source task t 1 i , and f v i, j denotes the finish time (or completion time) of the j th job of the task t v i in t i . Such bounds are given by the following theorem.
THEOREM 1 (THEOREM 1 IN [19] ). If Q is the set of all tasks along the worst-case path 6 from t 1 i to t v i , including both 6 That is, the path that maximizes the given sum t 1 i and t v i , then any job J v i, j completes within
time units after time a 1 i, j .
It is important to note that the existence of this bound relies crucially on the fact that all task graphs are acyclic. As mentioned earlier, this is not necessarily true of task graphs defined via the OpenVX specification.
Dealing with blocking times due to GPU accesses. The latency bounds mentioned above entail no CPU capacity loss because the only preconditions for their existence are that U sum  m holds and U v i  1 holds for each i and v. However, when accounting for delays that jobs may experience as they wait to access GPUs, CPU capacity loss will generally occur. Under GPUSync [7, 8, 9] , such delays are accounted for through suspension-oblivious analysis [25] wherein priority-inversion-related blocking times due to the usage of locking protocols are analytically modeled as CPU computation time. This causes an artificial inflation of per-task WCETs, and correspondingly inflated task utilizations. Such inflations can cause a loss of some fraction of the underlying hardware platform's available CPU capacity. However, any such loss is usually more than offset by the significant acceleration afforded by the usage of GPUs [9] . Because the effects of GPUs are dealt with by inflating WCETs, we can henceforth ignore them and assume we are working with WCETs that have already been properly inflated. Buffer bounds. As mentioned in Sec. 3, pipelined execution can be enabled under OpenVX by replicating data objects, but this requires safe replication bounds. Such bounds can be obtained by extrapolating from prior work by Goddard and Jeffay on bounding the size of token buffers in PGM graphs [26] . However, because we are working with simpler sporadic DAGs here, it is possible to obtain tighter results by proving new bounds from first principles. Additionally, we must concern ourselves with the possibility that the same data object may be accessed by different tasks at different times (e.g., the i th video frame might be accessed by the i th invocations of several tasks without being copied between accesses). Leveraging these results. To summarize, to leverage prior work on end-to-end latency bounds, we must find a way of eliminating the apparent cycles caused by delay edges in OpenVX graphs. To be able to enable pipelined execution in OpenVX graphs, we must determine safe bounds for replicating data objects. These issues are considered in the following two sections.
Dealing with Delay Edges
In order to leverage the prior results just discussed, we introduce the concept of a dependency graph. Given a set of OpenVX graphs, the i th dependency graph, G i , is associated with the i th OpenVX graph. The v th node in G i is viewed as a sequential task t v i , as in the sporadic DAG task model. Dependencies among tasks in G i are as implied by the corresponding OpenVX graph. Specifically, G i has the same forward and delay edges as the i th OpenVX graph. A forward edge from the v th node to the w th node, v ! w, indicates that job J w i, j cannot commence execution until job J v i, j completes; a delay edge from the v th node to the w th node, v 99K w, indicates that job J w i, j cannot commence execution until jobs prior to J v i, j have completed.
To be more precise about the back-trace history associated with the delay edge v 99K w, we introduce two per-edge parameters h and k, where h k, to specify the precise back-trace history implied by the delay edge v 99K w: J w i, j may need the results of the jobs J v i, j h ,...,J v i, j k , but does not need the results of jobs outside of this range. Note that, in most existing computer vision algorithms, k = 1 for every delay edge. (Although h and k are peredge parameters, we have avoided using superscripts or subscripts to indicate the intended edge, for simplicity.)
We define a dependency graph to be well-formed if and only if it contains no cycles or delay edges. A set of well-formed dependency graphs corresponds naturally to a sporadic DAG task system, assuming (as we do here) that each graph's source node is invoked periodically (and hence sporadically) according to some given video frame rate. However, the set of dependency graphs arising from a given OpenVX-specified application may not be well-formed. Our goal in this section is to show how to transform such a set of graphs to a corresponding set where each graph is well-formed. We show this by considering the concept of a refinement. The dependency graph G 0
i is a refinement of the dependency graph G i if both have the same nodes and G 0 i is at least as restrictive as G i , i.e., all dependency restrictions in G i are implied by G 0 i or can be guaranteed under G-EDF scheduling. For now, we ignore the issue of replicating data objects to prevent overwriting (equivalently, each data object can be assumed for now to be infinitely replicated to prevent overwriting); that issue is addressed in Sec. 6. Rules for constructing well-formed refinements. In the rest of this section, we consider three rules that can be repeatedly applied as needed to a dependency graph G i to obtain a well-formed refinement of it. Each such rule application eliminates one or more delay edges in G i . Once all delay edges have been eliminated, no cycles can exist. After all three rules have been stated and explained, we illustrate them with an example at the end of this section. (The reader may wish to consult the example as each rule is introduced.) The first rule handles delay edges that do not actually cause cycles.
Delay-Edge Strengthening Rule: If the delay edge v 99K w is not part of any cycle, then replace it by a forward edge v ! w.
Note that applying this rule always yields a valid refinement. To see why, observe that the original delay edge v 99K w indicates that the job J w l, j cannot commence execution until after the jobs J v l, j h ,...,J v l, j k have completed, while the forward edge v ! w indicates that J w l, j cannot commence execution until after J v l, j has completed. Because tasks are sequential, the latter clearly implies the former.
The remaining two rules can be applied to eliminate cycles. The first of these eliminates delay edges that are not actually necessary.
Delay-Edge Dropping Rule: If, under G-EDF scheduling, job J v i, j k is guaranteed (via response-time analysis) to be complete by the release time of job J v i, j for all j, then the delay edge v 99K w can be removed.
Intuitively, this rule can be applied if k is "large enough" to ensure that the back-trace history required by J v i, j is sufficiently "far in the past" that the precedence constraint implied by the delay edge is satisfied by G-EDF scheduling anyway. The following theorem can be applied to determine if k is "large enough." THEOREM 2. If, for each delay edge v 99K w in a dependency graph, k satisfies
then the Delay-Edge Dropping Rule can be applied to eliminate all such edges. Specifically, for each such edge, the job J v i, j k is guaranteed to be complete by time a w i, j , where (generalizing our earlier notation) a w i, j denotes the release time of the job J w i, j .
Proof. We prove this theorem by contradiction. Assume that (1) holds and consider the corresponding graph where all delay edges have been eliminated. This graph is acyclic, and hence Theorem 1 can be applied. Assume that J v i, j k has not completed by a w i, j . Because the j th job release of the task t w i cannot precede the j th job release of the source task t 1 i , a 1 i, j  a w i, j . From our assumption, this implies that J v i, j k has not completed by time
Because the source task t 1 i is invoked sporadically with a minimum release separation of T i , we have
By (2) and (3),
which contradicts (1). Theorem 2 gives the system designer the option of adjusting the k parameter of any delay edge to be "large enough" so that that edge can be effectively eliminated. However, in practical terms, this means that the computer vision algorithm is being altered to rely on back-trace history that is "older." This could result in a loss of accuracy in some vision algorithms. Therefore, we need a rule that provides an option for breaking cycles that does not involve such algorithmic alterations. Our final rule provides such an option.
Super-Node Creation Rule: Combine several nodes from the same graph that have dependencies with respect to each other due to delay edges into a single "super-node" that is executed as an ordinary task. 7 Each edge (forward or delay) from a node outside of the super-node to a node within the super-node becomes an incoming edge of the super-node. Similarly, each edge (forward or delay) from a node within the super-node to a node outside of the super-node becomes an outgoing edge of the super-node. The j th job of the super-node is executed sequentially by executing the j th jobs of all tasks within the super-node in an order allowed by forward edges. The WCET of the super-node is the sum of the WCETs of the contained tasks. (Recall that all tasks within the same graph have the same period.) The super-node's utilization must be at most one.
The application of this rule will result in a valid refinement, because any precedence constraints among tasks within a super-node implied by delay edges among them will be implicitly satisfied due to the enforced serial execution order. Such an enforced serialization order reduces parallelism, which may seem like a heavy-handed technique for eliminating delay edges. However, for the common case in computer vision algorithms where k = 1 for such an edge, the following theorem shows that an implicit serialization order often exists anyway. THEOREM 3. Suppose there is a forward-edge path from the w th node to the v th node and v 99K w is a delay edge that therefore causes a cycle. Assuming k = 1 for this edge, no jobs of any two tasks in this cycle can execute in parallel. in the mentioned cycle. Jobs of the same task clearly execute in sequence, so assume that p 6 = q holds. We consider two cases. For any dependency graph, it is possible to repeatedly apply the above rules and eliminate all delay edges and cycles, resulting in a final graph that is well-formed, provided applications of the Super-Node Creation Rule do not create a super-node with utilization exceeding one (according to Theorem 3, if this occurs, over-utilization may likely have been inherent in the original graph anyway). However, whenever the Super-Node Creation Rule is applied, parallelism is sacrificed. Thus, its use should be avoided if possible. We conclude this section by illustrating these rules with an example. Example. Consider again the graph in Fig. 4 . As a first step, we apply the Delay-Edge Strengthening Rule to each delay edge that does not cause a cycle, i.e., all delay edges except the one from the node "Harris Feature Tracker" to the node "Compute Optical Flow." We can then eliminate any potential cycles by applying the DelayEdge Dropping Rule to this last remaining delay edge, yielding the well-formed graph shown in Fig. 7 . Note, however, that applying this rule could involve potentially altering the computer vision algorithm to use a value of k that satisfies (1) for the dropped delay edge. If this is not feasible, then we could alternatively apply the Super-Node Creation Rule to combine the two nodes connected via this delay edge into a single super-node, provided the utilization of this supernode is at most one, and obtain the well-formed refinement shown in Fig. 8 . For either well-formed graph, Theorem 1 could be applied to determine latency bounds.
Replica and Buffer Bounds
The analysis in the prior section focused on maintaining required precedence constraints when eliminating cycles when individual graph nodes, rather than entire graphs, are viewed as schedulable entities, i.e., as tasks. However, as noted in Sec. 2, graph edges are not explicitly declared in OpenVX but are inferred from how data objects are bound to nodes as parameters. Furthermore, as noted in Sec. 3, when individual nodes are viewed as schedulable entities, there is a danger that the data objects associated with a given edge may be overwritten. As noted there, this problem can be addressed by replicating such objects. However, for such an approach to be feasible, safe replication bounds are needed. In this section, we present such bounds. We assume that the transformations discussed in the prior section have already been applied, but we still require information exposed by the original untransformed graph. We consider ordinary data objects and those associated with delays in separate subsections.
Data Object Replicas
In discussing data object replication, we assume that no data object is accessed by multiple OpenVX graphs (or equivalently, the set of such graphs would have to be treated as one graph here). Therefore, we assume that we are working with a fixed graph and avoid introducing identifiers to indicate which graph where possible.
To avoid overwriting with respect to forward edges in the transformed graph, we can replicate each data object N times, indexing the replicas from 0 to N 1, and storing them in per-data-object buffers with N entries each. We require the j th invocation (i.e., job) of any task in the graph under consideration to access the ( j mod N) th replica. (Note that we are replicating every data object accessed within the graph to the same degree; different per-object replica bounds can be obtained with finer-grained analysis.)
With data objects replicated like this, we merely need to guarantee that when the ( j + N) th job of any task is executing, no job prior to the ( j + 1) st of any task can access any data object, i.e., the j th 
then J v i, j+N will not execute at or before f p i,l for all p and for all l  j.
Proof. We prove this theorem by contradiction. Suppose that (11) holds and at time t, where t  f p i,l , J v i, j+N is executing. Then, because the z th i node is the sink node,
By Theorem 1,
Furthermore, J v i, j+N cannot execute until at or after the ( j + N) th invocation of the source node (task t 1 i ), i.e., time a 1 i, j+N . Therefore,
By (12), (13) , and (14),
Because the source node (task t 1 i ) releases jobs sporadically and l  j,
By (15) and (16),
which contradicts (11).
Ring Buffers for Delay Edges
As mentioned in Sec. 2, each delay edge in OpenVX is actually defined by special data object called a "delay," which is used to buffer node output for use by subsequent node invocations. A delay is essentially a ring buffer used to contain other data objects, where the oldest data object is overwritten when a new data object is produced. Therefore, if the ring buffer size is not large enough, data objects that are being used may be prematurely overwritten. Thus, we also require safe bounds on ring buffer sizes, so that no such overwriting will occur.
Although in Sec. 5 we analytically transformed each original dependency graph G i to a well-formed refinement, in the context of considering ring buffer sizes, we still need to consider G i , which directly represents the original OpenVX graph, wherein the needed delay data objects are fully exposed. We consider a delay edge v 99K w in G i . The following theorem provides a sufficiently safe buffer size for each delay edge. Proof. We consider an arbitrary job of
will not execute at or before f w i,l for all l  j N. That is, when J v i, j is executing, J w i, j N and all prior jobs of t w i have already completed. Therefore, only J w i, j N+1 or later jobs of t w i may execute afterwards. Those jobs may require the result of some prior jobs of node v but no earlier than job J v i, j N+1 h (recall the definition of h given earlier in Sec. 5). So, when J v i, j is writing data into the ring buffer, we only need to keep the result of J v i, j N+1 h and later jobs in this ring buffer. Thus, a ring buffer size of N + h is sufficient. The delay edge v⤏w , which is a back edge. Figure 9 : Illustration for the ring buffer bound for a delay edge q that is a back-edge (i.e., causes a cycle).
The following theorem provides a significantly tighter buffer size bound in a common special case. THEOREM 6. If v 99K w causes a cycle in G i and is the only delay edge in that cycle, then a ring buffer size of h is sufficient.
Proof. If v 99K w is the only delay edge in a cycle, then there is a forward-edge path from the w th node to the v th node, as shown in Fig. 9 . Suppose that the most recently ready job of t v i is J v i, j . Due to the forward-edge path, J v i, j being ready implies that J w i, j has already completed, which means only J w i, j+1 or later jobs of t w i could execute next and need delay buffer data. Therefore, the earliest delay buffer data that will be needed in the future is that from J v i, j+1 h . (By the definition of h given earlier in Sec. 5, J w i, j+1 may require the result of some prior jobs of node v but no earlier than job J v i, j+1 h .) Moreover, since J v i, j , by definition, is the most recently ready job of t v i , no job of t v i later than J v i, j is ready, let alone is executing. Thus, a buffer size of h is sufficient.
Conclusion
The need to support real-time graph-based computer vision applications in embedded domains such as in the automotive industry is of growing importance. Moreover, to reap size, weight, and power advantages, there is growing interest in using GPUs in supporting such applications. Given that OpenVX is a ratified standard, it is likely to see widespread use for this purpose in the future. The case for adopting OpenVX is further strengthened by NVIDIA's dominance in the GPU sector and their implicit backing of OpenVX through the development of VisionWorks.
When real-time correctness is a concern, the use of OpenVX creates several challenges. In a prior paper [4] , we presented a new OpenVX implementation, based on a variant of VisionWorks, that addresses these challenges. That paper specifically focused on implementation details and a case study, with needed analytical results that justify the implementation only briefly sketched. In fact, a complete explanation of these analytical results was deferred to a separate paper-namely, this one.
These analytical results can be factored into two main contributions. First, we presented transformations that can be applied to OpenVX-derived graphs to eliminate delay edges and cycles so that prior work on end-to-end latency bounds can be applied. These transformations involve treating individual graph nodes as schedulable entities. This can create data hazards that can be avoided by replicating data objects, but safe replications bounds are needed for such an approach to be feasible. As a second contribution, we showed how to derive such bounds. Together with [4] , the results of this paper provide a solid foundation for supporting OpenVX graphs on multicore+GPU platforms in a way that encourages parallelism through piplelining while allowing real-time guarantees to be validated.
