Abstract-Designing applications for scalability is key to improving their performance in hybrid and cluster computing. Scheduling code to utilize parallelism is difficult, particularly when dealing with data dependencies, memory management, data motion, and processor occupancy. The Hybrid Task Graph Scheduler (HTGS) increases programmer productivity when implementing hybrid workflows that scale to multi-core and multi-GPU systems. HTGS manages dependencies between tasks, represents CPU and GPU memories independently, overlaps computations with disk I/O and memory transfers, keeps multiple GPUs occupied, and uses all available compute resources.
I. INTRODUCTION
Hybrid clusters now play a prominent role in high performance computing; they make up five of the top ten fastest supercomputers as of Nov 2015 [1] . These petascale clusters consist of nodes that contain one or more CPUs with one or more co-processors (Intel Xeon Phi [2] /NVIDIA Tesla [3] ). Clusters are approaching the exascale level; the next generation of hybrid architectures will contain fat cores coupled with many thin cores/accelerators on a single chip, as seen on Intel's Knights Landing [4] . Programming these exascale machines for performance will be challenging. This will require new methods to emphasize minimizing data movement and maximize the number of computations done on that data [5] .
This paper builds on our previous work that implemented image stitching for large scale optical microscopy images [6] .
In that work, we developed a sequential implementation using optimized kernels and ported the computational functions to the GPU. Porting the code directly to the GPU resulted in poor performance. This led us to develop an implementation based on a hybrid workflow. The hybrid workflow keeps all GPUs and CPUs busy while effectively overlapping data movement with computations.
This paper presents the Hybrid Task Graph Scheduler (HTGS) to aid in building hybrid workflows for high performance image processing. HTGS is a framework and runtime system, which hides data motion, maximizes processor occupancy when running on hybrid computers, and manages memory usage to stay within system limitations. The penalties for not properly managing data motion are exposed in our previous image stitching implementation [6] .
Image stitching is used to address the scale mismatch between the dimensions of the microscope's field of view and the plate under study. To image a plate, a motorized stage acquires a grid of overlapping images. The positions of these images are computed by stitching neighboring tiles together. The positions are used to construct the image mosaic. The algorithm consists of three compute stages: (S1) the fast Fourier transform (FFT) of an image, (S2) the phase correlation image alignment method (PCIAM) [7] that acts on two neighboring images' FFTs, and (S3) the cross correlation factors (CCFs) between two neighboring images focused around a maximum intensity point identified from the PCIAM. Figure 1 shows the data flow graph. In our previous work, we started with a sequential CPU implementation, which was ported to the GPU (Simple-GPU). In Simple-GPU, NVIDIA's CUDA is used to process images and data is copied to the GPU as needed. The results show a 14% speedup compared to the sequential CPU implementation. Data motion between co-processors and CPUs dominates the performance. Using the existing compute kernels from Simple-GPU and scheduling their invocations in a hybrid workflow that properly manages memory and overlaps computations with data motion improves the Simple-GPU implementation by 24x and scales with multiple GPUs.
Developing hybrid workflows is complex and time consuming. HTGS aids in implementing hybrid workflows by using task graphs and includes a runtime system to schedule the graphs on hybrid collections of compute resources (i.e., CPUs and GPUs). HTGS helps build task graphs that handle dependencies, manages memory in multiple native address spaces (CPU/GPU), scales to multi-GPU systems through execution pipelines, and overlaps data motion with computations. Every task created through HTGS exposes the computational resources and binds tasks to the physical hardware.
Workflows have been studied by a number of groups: Concurrent Collections [8] , Intel Threading Building Blocks [9] , and Spark [10] . One hybrid workflow runtime system that uses tasks is StarPU [11] . StarPU uses work stealing to overlap CPU and accelerator computation. This is achieved by representing both memories as a unified address space and implementing a kernel for each architecture. The method is convenient, but can result in inefficient data transfer patterns. In order to gain efficient performance, it is necessary to carefully structure the delivery of data between resources as demonstrated by the Simple-GPU implementation of image stitching. HTGS achieves this by representing the address spaces of the underlying architectures separately (i.e., CPU and GPU).
In the following sections, we introduce HTGS using matrix multiplication as an example. Then we discuss a prototype implementation of image stitching using HTGS and present preliminary results and conclusions.
II. HYBRID TASK GRAPH SCHEDULING
HTGS consists of four components: (1) Tasks, (2) Data, (3) Dependency Rules, and (4) (Optional) Memory Rules. These components construct a modified task graph, which embeds nodes in a pipeline.
The modified task graph is a series of vertices and edges, which stand for tasks and data flow respectively. A task represents some function applied to the data, and data flow defines the schedule of data movement between tasks. This methodology is similar to that of signal processing data flow graphs (e.g., see [12] ), except that HTGS blends the definitions of functional components with resource allocation decisions, and incorporates special graph vertices (tasks) for memory management. More specifically, every task in a task graph is bound to one or more threads and to physical resources such as GPUs. There are four task types specified by HTGS to aid with building a hybrid task graph: (1) Memory Manager, (2) Bookkeeper, (3) Execution Pipeline, and (4) CUDA Task.
The four task types are briefly described below:
1) A Memory Manager manages memory through a pool or dynamic allocation to reuse data based on memory rules. The memory rules define the state of memory and when it can be released. 2) Bookkeeper tasks maintain the global state of a computation and handle dependencies.
3) The Execution Pipeline replicates a task graph to scale on multi-GPU systems. 4) CUDA Task binds a CPU thread to a GPU context at initialization. The context becomes available to the task's function to schedule work on the GPU. This task can be adapted to run on alternate GPU architectures such as AMD GPUs using OpenCL.
III. EXAMPLE: OUT-OF-CORE MATRIX MULTIPLICATION
To demonstrate the functionality of HTGS, we present an implementation of out-of-core matrix multiplication. The task graph presented has not been implemented and is used to demonstrate the functionality of building task graphs using HTGS. Matrix multiplication is computed by multiplying row entries from matrix A by column entries from matrix B, then adding their products into matrix C (A ⊗ B = C). We split matrix C into square sub-matrices such that to compute submatrix C i,j , we need horizontal and vertical slices of A i and B j , as shown in Figure 2 . In out-of-core matrix multiplication, matrices A and B may not fit in main memory, therefore sub-matrices must be loaded from disk. These pieces are defined by horizontal and vertical slices. Figure 3 shows the corresponding task graph. One possible hybrid workflow consists of several tasks to compute out-ofcore matrix multiplication and contains one dependency. The data in this task graph consist of horizontal and vertical slices. MM A and MM B are memory managers for matrices A and B that allocate slices. Load A and Load B load slices for A i and B j from disk. Each slice is sent to the first bookkeeper (BK 1 ). BK 1 manages the state of the computation by identifying which slice has been loaded and which can be scheduled to compute A i ⊗B j = C i,j . Next, MatMul computes the matrix multiplication between each slice A i with B j for C i,j , which is written to disk. Writing to disk can be extracted as its own task, but would require an additional memory manager. This step is removed to simplify the HTGS example. The final task in the graph is the second bookkeeper (BK 2 ). BK 2 is responsible for forwarding memory back to its appropriate memory manager.
MMA

MMB
LoadA
LoadB
BK1
MatMulA i ×B j =C i,j BK2 Fig. 3 . Out-of-core matrix multiplication task graph
The memory rules for this task graph depend on the matrix sizes targeted by the computation and the amount of memory available. Each rule impacts the dependency rules defined in BK 1 . The amount of memory available to the computation is defined by the size of the memory pools for MM A and MM B , which is determined based on the size of the slices. Increasing the memory pool size increases the amount of data flowing through the task graph. There are three ways of defining the memory rules for out-of-core matrix multiplication: (1) release-always, (2) release-when-ready, (3) do-not-release. These memory rules can be defined by a simple reference counter, which is incremented when the memory returns to the memory manager.
1)
Release-always will release memory as soon as memory enters the memory manager. 2) Release-when-ready releases memory allocated to A (or B) when all computations for that memory are completed, such that for slice A i , B 0 to B n−1 must be loaded and computed with A i . The reference counts for A and B are n and 1, respectively. 3) Do-not-release will never release memory for either A or B, which is useful for in-core matrix multiplication.
Each memory rule has its advantages and disadvantages. Release-always processes the matrix multiplication sequentially and stays withing memory limits. Release-when-ready reuses A, but requires the re-read of B n times. Do-notrelease maximizes the number of items being processed in the pipeline, but requires the matrix to be in-core. The memory rule for out-of-core matrix multiplication is Release-whenready as it features the pipelining from Do-not-release, while staying within memory limits. This workflow system can be extended to use GPUs by representing MatMul as a CUDA Task, preceded by a data-motion task.
To scale out-of-core matrix multiplication across multiple GPUs, the task graph from Figure 3 is added to an execution pipeline. Given n pipelines for n GPUs with 1 pipeline per GPU (each pipeline bound to a separate GPU), Matrix A is decomposed into n equal-sized pieces. The execution pipeline task will duplicate the task graph and process each sub-matrix of A independently.
IV. PROTOTYPE -HYBRID MICROSCOPY IMAGE STITCHING
As shown in Figure 4 , hybrid image stitching consists of eight tasks. The eight tasks are listed below: 1) mm-memory manager that generates CUDA FFT memory 2) read-reads an image 3) copy-copies an image to GPU memory from the memory manager, thereby hiding the cost of data motion by executing concurrently on the GPU 4) FFT-computes the forward fast Fourier transform on the GPU 5) bk1-identifies when two neighboring images have their FFTs computed 6) pciam-computes the phase correlation image alignment method between two neighboring tiles on the GPU 7) bk2-forwards CUDA memory back to the memory manager and forwards the output from pciam to the next task 8) ccf-computes the cross correlation factors on the CPU There is one dependency that requires the FFTs of two neighboring tiles to be computed before processing the PCIAM function. When an image's FFT is available, the FFT can be used in computations with its northern, southern, eastern, and western neighbors. To avoid unnecessary FFT computations, the memory manager uses a reference count to keep FFTs in memory. The reference count refers to the number of times an image's FFT is used with its four neighbors (three for boundary cases, and two for the corners).
The task graph in Figure 4 will execute on one GPU only. To scale to multiple GPUs, the graph is added to an execution pipeline as shown in Figure 5 , which is then instantiated once per GPU. Each task graph inside the execution pipeline is duplicated and the image tile grid is decomposed evenly for each inner task graph. The ccf task remains outside of the execution pipeline and processes CCFs using a pool of CPU threads.
V. RESULTS Table I compares our novel HTGS-based implementation of hybrid microscopy image stitching with the implementation without HTGS [6] . Each test case is repeated 50 times using a grid of 42 × 59 images (6.6 GB) and the average end-to-end run-time is reported. The machine used has two Intel Xeon E5620 CPUs (16 logical cores), two NVIDIA Tesla C2070s GPUs and one NVIDIA GTX 680 GPU. The implementation is written in Java and uses the JCuda [13] and JCuFFT [14] libraries. Table I shows that using HTGS without execution pipelines reduces the code size by 23.6% compared to the original hybrid workflow [6] . Including the execution pipeline enables the hybrid workflow to scale to multiple GPUs and obtains a performance improvement of 1.7x with three GPUs at the cost of one additional line of code. Execution from two to three GPUs shows little performance improvements due to reaching the maximum disk bandwidth.
VI. CONCLUSION
This work represents an early prototype of the hybrid task graph scheduler. The results compared to the original implementation [6] and HTGS using 3 GPUs show a 23.6% reduction in code size and a speedup of 17% (24.5 s versus 29.8 s). Hybrid workflows are effective at parallelizing an algorithm, hiding data motion, and keeping processors busy. HTGS reduces the effort required to represent hybrid workflows in image stitching, while maintaining the performance of manually creating a hybrid workflow. HTGS also provides a framework for representing algorithms and tools for complex, data-intensive applications that require very high performance.
VII. FUTURE WORK
HTGS is implemented in Java. Using native bindings, Java can perform exceptionally well on hardware, but providing the native bindings is an extra step and can cause extra overhead going through the Java virtual machine. To resolve this, we are in the process of porting HTGS to C++.
Execution pipelines are an excellent way to scale task graphs to multiple GPUs. This method can be applied to clusters and hybrid clusters. Using an MPI execution pipeline, task graphs could scale to clusters. Given that an execution pipeline in the HTGS methodology is just a task in the task graph, each execution pipeline could contain one or more additional execution pipelines within itself. This recursive nature enables execution pipelines to effectively map to hybrid clusters. For example: one execution pipeline per GPU, per node in a cluster, or a combination of both.
We plan to release a version of HTGS with its source code in the near future.
DISCLAIMER
No approval or endorsement of any commercial product by NIST is intended or implied. Certain commercial software, products, and systems are identified in this report to facilitate better understanding. Such identification does not imply recommendations or endorsement by NIST, nor does it imply that the software and products identified are necessarily the best available for the purpose.
