Widespread heterogeneous parallelism is unavoidable given the emergence of GeneralPurpose computing on graphics processing units (GPGPU). The characteristics of a Graphics Processing Unit (GPU)-including significant memory transfer latency and complex performance characteristics-demand new approaches to ensuring that all available computational resources are efficiently utilised. This paper considers the simple case of a divisible workload based on widely-used numerical linear algebra routines and the challenges that prevent efficient use of all resources available to a naive SPMD application using the GPU as an accelerator. We suggest a possible queue monitoring strategy that facilitates resource usage with a view to balancing the CPU/GPU utilisation for applications that fit the pipeline parallel architectural pattern on heterogeneous multicore/multi-node CPU and GPU systems. We propose a stochastic allocation technique that may serve as a foundation for heuristic approaches to balancing CPU/GPU workloads.
Introduction
General-Purpose computing on Graphics Processing Units (GPGPU) introduces significant complexity into parallel CPU systems. Specifically, communication costs, latency, and non-linear efficiency characteristics are strongly input-dependent. In both single and multi-GPU configurations, one of several challenges is maintaining maximum utilisation of all available resources given the constraints of disjoint memory address spaces. In recognition of this problem, AMD and NVIDIA, the dominant players in the GPU arena, are moving towards integrated on-chip CPU/GPU systems-termed Accelerated Processing Units (APUs)-as part of their Projects Fusion and Denver respectively. APUs have started to demonstrate significant peak performance improvements for certain fine-tuned applications [6] .
However, the problem of coordinating CPU and GPU execution resources will arguably persist as there are significant barriers to overcome. In an efficient software implementation, CPU/GPU communication may be overlapped with computation at both ends to enable latency hiding while maintaining load balancing. However, it is not a trivial task to produce self-balancing code given the variability in relative performance of computational kernels across a wide range of low to high end CPU and GPU models. GPUs, being massively parallel processors, may require a large number of simultaneous inputs to minimise resource idling. Furthermore, any algorithmic improvements or new hardware that lead to a higher-performance CPU/GPU implementation may disrupt the delicate balance of an existing finely tuned application. Arguably, it is natural for GPU implementations to lag the latest algorithms implemented on CPUs for a broad range of applications.
Our research question is:
how do we ensure that both CPU and GPU resources are fully utilised in a (pipeline) approach to tasks that is cooperative rather than mutually exclusive ? In this paper, we propose a high-level heuristic strategy for achieving efficient performance in applications that conform to the pipeline parallel pattern. As the GPU computation model is based on the transfer of work units into and out of GPU memory, it can be naturally implemented as a pipeline as shown in Figure 1 . The aim of our approach is to ensure that the multi-core CPU and GPU resources available to an application are used to the maximum extent possible. We expect that with suitable refinements, this technique may be extended to generalised dataflow graphs and other architectural parallel patterns, a widely-shared research stance [2, 14, 16] .
The need to adopt a CPU/GPU balancing strategy has emerged practically during our work in an actual scientific modelling application [10] . Here architectural limitations have made it difficult to integrate our GPU kernel implementations of performance-bound numerical linear algebra routines/libraries with the existing program in an efficient manner. It may be argued that the deficiencies of the GPUas-accelerator model are indicators of incompatibility with traditional techniques of structuring parallel programs. We view this as an opportunity to identify and deploy a generic parallel pipeline pattern that may be applicable to similar CPU/GPU balancing problems.
Background
Eigenvector and eigenvalue determination are recurrent problems in computational modelling applications. As increasingly elaborate models become of practical value to scientific and engineering applications, the demands of solving larger eigensystems typically require the deployment of high performance numerical linear algebra libraries executing on large heterogeneous high-performance clusters or supercomputers. Eispack, and its successor Lapack, provide extensive linear algebra routines in mathematical and scientific computing. Originally developed in the US in the seventies [17] , the accuracy and numerical stability of Eispack has been established through diverse application over the past 30+ years, leading to a number of developments in the field [7] .
Within the context of our application, Hermitian eigensystems arise from the simulation of inelastic neutron scattering (INS) from polycrystalline materials, where a typical model requires eigendecomposition of several million matrices. As the complexity of a crystalline structure increases, the computational cost of solving these eigensystems grows rapidly with the standard routines of Eispack and Lapack which are of order O(n 3 ) for matrices of order n. Since the emergence of this technique, the demands of materials researchers in this domain have grown by several orders of magnitude. This has already exceeded the capability of all but the largest parallel machines for the current generation of significative models, driving new interest in alternative hybrid CPU/GPU computing models [5, 12] .
These eigensystem computations lend themselves to deployment as a divisible load as the solution of distinct systems is largely independent. Our earlier work in this field has involved an MPI-based parallel implementation that demonstrates near-ideal linear scaling behaviour on both multi-core and multi-node configurations [10] , and an ad-hoc GPU implementation of select Eispack routines [11] .
Different research groups have consistently reported similar improvements on manually-tuned linear algebra routines [4, 9] . Admittedly, the more modern Lapack-which has largely superseded Eispack-may have formed a functionally superior basis. However, the inherent architectural complexity and reliance on an efficient Blas implementation require significant resource commitment in a long-term effort. The Magma library is in the early stages of providing hybrid multicore-CPU/GPU implementations of Lapack routines [19] . Our implementation complements Magma for smaller problem sizes and is particularly tuned to the matrix dimensions in our current models. However, in the long term, Magma is the natural choice at larger matrix sizes.
Limitations of the GPU Accelerator Model
While the actual application is algorithmically complex and features a deep call hierarchy, a simplified high-level view of SPMD execution with cyclic decomposition over k identical processors is presented in Figure 2 .1. A GPU implementation exists for code section B, the most computationally expensive stage of the loop, blocks A and C are exclusively executed on the CPUs. As the iterations themselves are independent, this problem may be considered as a divisible workload with a GPU payload. Empirical evidence demonstrates that our GPU eigensolver implementations have performed, in isolation, up to two orders of magnitude faster than the equivalent CPU implementation. However the challenge arises when an attempt is made to integrate the GPU kernels with the original CPU code. Adapting this SPMD implementation to heterogeneous environments and the conventional GPU acceleration model exposes a number of limitations: (1) A, B and C finish at different rates as the performance characteristics of the processors are not identical. (2) The makespan is determined by the worst-case completion time of the slowest processor. (3) GPUs require a minimum amount of work for peak efficiency to allow the use of all available streaming multiprocessors, permit latency hiding and to offset the cost of host-device memory transfer overhead. While the limited host memory available to the application may only hold k processes, k instances of B will usually not be sufficient for peak GPU efficiency. (4) GPUs are typically idle during A and C; CPUs are typically idle during B. (5) The actual, relative performance of the CPUs and GPUs may only be determined at runtime as they vary significantly with the characteristics of the particular models being simulated. Algorithms must be typically vectorisable and the available data parallelism depends on problem size. (6) The presence of multiple GPUs in a given architecture adds complexity to the overall problem. . Individual and total Execution time per 720-order matrix at various CUDA grid sizes for htridi, tql2, and htribk, the main three Eispack kernels that perform Householder reduction to tridiagonal form, eigenvectors and eigenvalues by QR iteration and back-transformation respectively. Each block in the grid maps to a single eigensystem.
The GPU can deliver a performance advantage only when there are a large number of input matrices to keep all streaming multiprocessors active (see Figure 3) . For smaller models, this number may be several hundreds. Nevertheless, the memory footprint of this application is large and there are a limited number of processes that can be accommodated in CPU memory, even when the performance penalty of inefficient virtual memory is neglected. In the current MPI-based SPMD implementation, the application is unable to generate sufficient GPU workload within these memory constraints. Furthermore, the generation of the dynamical matrix is itself a non-trivial computation that cannot proceed. As latency hiding cannot be accomplished by overlapping communication and computation, both sets of resources are forced to be idle as CPU/GPU execution becomes mutually exclusive.
The core issues are that ı) the cost of the memory transfer to and from the GPU is significant and ıı) both CPU and GPU resources are alternately idle while the computation is in progress.
While maintaining the GPU code performance at two orders of magnitude above the CPU code, we have upgraded our original INS application to Lapack. Evaluations of the algorithmic variants for Hermitian eigensystems in Lapack demonstrate model-dependent performance characteristics. With the benefit of these new routines, the CPU code performs at one order of magnitude faster. Given alternative execution paths, where performance depends on the characteristics of the Hermitian dynamical matrix corresponding to the model, tuning the code to achieve workload balancing is largely manual. Different approaches have been suggested in the literature to achieve this that include CPU-GPU call-back strategies to streamline the data exchange [18] and the heuristic selection of automatically-generated GPU code deployments [8] . We therefore seek a way to efficiently balance the utilisation of both resources dynamically in an automatic manner.
As a practical problem, our objective is not to find a theoretically-optimal solution at potentially significant runtime expense, but one that provides reasonable resource usage that is acceptably close to the peak obtainable performance. Previous work in the field suggests that a parallel pipeline pattern [1, 13] is a viable approach to minimising CPU and GPU idle times. It is possible to expose a highlevel application-independent interface to the application programmer that takes advantage of the intrinsic pipeline characteristics of the GPU hardware [15] . While it has been demonstrated that a polynomial time solution exists for identical processors in a linear pipeline, introducing heterogeneity creates an NP-hard allocation problem [3] .
Given alternate paths of execution representing differing but equivalent implementations with a range of performance characteristics that depend on both inputs and target hardware with varying memory requirements, how do we make the best choice when this information is only available at runtime? We contend that a heuristic approach based on a queue monitoring strategy should be able to accommodate these requirements. 
Queue Stability and Peak Utilisation
To recast this problem as a pipeline, we have decoupled the three stages of (A) matrix generation, (B) matrix diagonalisation and (C) determination of scattering contributions. Of these, the diagonalisation of the dynamical matrix is usually but not necessarily always the dominant aspect of the computation.
The advantage of a loosely-coupled asynchronous pipeline is that the memory constraints of the SPMD approach can be avoided. Each work unit flowing through the pipeline corresponds to a unique triple (Q, θ, φ), which can be processed independently at every stage. It is noted that an intermediate buffer or queue is necessary between pipeline stages.
Representing this application as a three-stage pipeline and considering the queue levels makes the efficiency problem readily expressible. Stages I and III of the pipeline are strictly CPU-only stages as GPU implementations are unavailable. As Stage II is the computationally dominant stage of the pipeline and has both CPU and GPU implementations, only the GPU implementation is in use as it offers higher performance. For stages I and II, there are two expected scenarios depending on whether the CPU or GPU constitutes the bottleneck in the pipeline ( Figure 5 ). (1) Producer Deficit: Stage I constitutes the bottleneck to the pipeline if and only if it is unable to provide and process data for the stage II GPU at a sufficient rate. This may arguably be the result of performance or memory limitations on the host machine and is illustrated in Figure 5 (a) for a simple simulation. In this case, it may be necessary to use external resources to increase output from this stage for better GPU utilisation. Adding external CPUs will eventually lead to the next scenario. (2) Consumer Deficit: If Stage II constitutes the computational bottleneck in the application pipeline, the CPUs at stage I will complete their work units early after filling the output queue and will wait for the GPU to complete the current batch of work at stage II. As presented in Figure 5 (b), there is noticeable resource waste at the idle stages of the pipeline caused by a pattern of intermittent CPU activity. In a basic allocation scheme, the application will be in one of these two inefficient states for which the buffer levels are clear indicators of the presence and nature of the bottleneck. The overall throughput of the pipeline is that of the slowest individual stage.
We can eliminate the extreme cases that lead to buffer overruns and underruns by dynamically reallocating the CPU at stage I to other stages of the pipeline (Stage II) when necessary. This approach can be reduced to two simple rules local to each process:
(1) Keep all queue levels q i above a set lower bound q L to ensure the availability of work for the inputs to succeeding pipeline stages i.e. q i > q L . (2) Keep all queue levels q i below a set upper bound q U to ensure the availability of space for the outputs of preceding pipeline stages i.e. q i < q U . Figure 6 shows that this basic strategy ensures that all execution resources are fully utilised. Queue stability occurs when the rates of production and consumption are matched and buffer overruns and underruns are avoided. This is only possible when the buffer levels are fairly stable.
In order to visualise the behaviour of our heuristic approach, let us consider the state of the system queues to be a moving 'particle' in an n−dimensional state, or queue space. Then, it is possible to represent the individual throughput at each stage T A , T B and T C as 'velocity vectors' for which:
• The n components of the state vector are the n queue levels • The velocity vector magnitudes are the magnitudes of the throughput T i • The velocity vector orientations follow from the dataflow topology i.e stages between queues (e.g. T B in Figure 7 ) have a negative component on the incoming queue and a positive component on the outgoing queue state vector components • The system trajectory is in the direction of the resultant velocity vector Figure 7 illustrates that the effect of the heuristic is to keep the system within the bounding box.
Since dynamic reallocation requires that queues are accessible from remote processes in a distributed environment, distributed queue processes and a submission and retrieval protocol based on message passing are necessary. Coupled with end-toend tracking of issued and completed iteration indices at the pipeline endpoints, this loosely coupled architectural approach facilitates the implementation of fault tolerance. In this model, it is possible to isolate local process or node failures and achieve application state 'checkpointing' by saving enqueued data to persistent storage.
Stochastic Allocation
We seek an implementation that not only uses idle CPU resources to assist overworked GPU stages and prevent queue instability but is sufficiently flexible to facilitate the adoption of more complex heuristic strategies for pipeline optimisation. As dynamic reallocation on distributed queues effectively represents the application as operations (stages) between queue pairs, we propose a dynamic stochastic allocation scheme in which processes are allocated to multiple pipeline stages and have a probability p ij (t) of executing pipeline stage i at any time t.
Let t ij represent the number of work units processed per time unit or throughput for pipeline stage i when executing exclusively on processor j and let p ij represent the probability that processor j will execute pipeline stage i at any time. Then the following assertions hold:
• The sum of all probabilities for processor j over all pipeline stages is one
• For queue Q i connecting stages i and i + 1 of the pipeline, the rate of change of the queue level q i is given by
• If the pipeline is stable, the average queue levels are constant, therefore the rate of change of queue levels is zero It may be concluded from Equation (3) that for stable queues without supply or consumption deficits, T i = T i+1 for all stages of the pipeline. Furthermore, there is an effective throughput T that is constant for all stages. With probabilistic allocation, the likelihood of executing the bottleneck stage i may be raised for all or a subset of processes, thereby redirecting computational resources away from pipeline stages with idle capacity and increasing the throughput of the entire pipeline.
Effective throughput for stochastic allocation
Consider a time duration, d, with N total work elements processed. Then for probability of choice p,
Dividing by duration, we obtain an expression in terms of effective throughput
However, the duration of each individual computation is 1 /Ti and the total time spent is
Dividing by d results in an expression in terms of effective throughput
From the simultaneous solution of Equations 4 and 5, we obtain
Stochastic allocation should therefore allow control of process assignment to pipeline stages with greater refinement than is possible with oversubscribed heavyweight threads or processes managed by the operating system scheduler, whilst avoiding the associated context switching overhead. Stochastic allocation allows the pipeline to potentially be realisable on anything from a single CPU to a very large number of mixed CPUs and GPUs in a distributed environment.
Our approach transforms the traditional processor allocation problem from a discrete to a continuous optimisation problem, allowing the exploration of a broader class of heuristic techniques. A useful analogy from quantum mechanics is that the system state may be regarded as a superposition of allocations.
It is therefore possible to maximise resource utilisation by adjusting the stochastic allocation p ij such that the effective throughput at each pipeline stage are uniform. From equation 2, any imbalance between pipeline stages is readily indicated by the average rate of change of the connecting queue.
In practical terms, it is possible to monitor the rate of change of queue Q i to identify potential bottlenecks and unused capacity that are indicative of variations between consecutive pipeline stage throughput T i and T i+1 and adjust these values by varying p ij and p i+1j in real time.
One advantage of this approach is that adjustments to p ij can be performed locally by each processor on the basis of existing queue levels without centralised control by a scheduler. Work distribution then becomes an emergent behaviour of the entire application. Furthermore, performance variations from sudden fluctuations in resource availability or performance will cause detectable variations in queue levels to which the pipeline may quickly adapt, before overall performance is significantly affected.
Concluding Remarks
In this paper we have presented a generic methodological approach to balance the CPU/GPU workload by adaptive queue monitoring and dynamic reallocation in a generic parallel pipeline pattern. Such patterns may arguably be generalised into multi-stage pipelines with other nested patterns at every stage (e.g. a farm) to allow the inclusion of multiple hybrid computing elements without the need to manually tune the application for all possible architectures.
It is important to mention that, in early testing, our heuristic approach has demonstrated dynamic adaptability, not only to avoid the bottlenecks of the CPU/GPU pipeline and improve resource utilisation, but also to provide decoupled execution that supports dynamic resource provisioning and fault tolerance in a distributed application.
The simple three stage pipeline is a compositional unit of more complex dataflow graphs. Our future work intends to apply performance modelling and queueing theory to analytically model the behaviour of these larger and more complex networks of heterogeneous elements that conform to widely deployed parallel constructs. Determination of the entries of the stochastic allocation matrix P is a continuous optimisation problem and further work is required to explore and characterise the nature of the solution spaces and the heuristic algorithms best suited to their exploration.
Additionally, we plan to pragmatically evaluate these approaches using a large test-bed based on numerical linear algebra routines immersed into our computational nanoscience code. The aim is to fully integrate fault tolerance and dynamic provisioning of computational resources at runtime with transparent efficiency maximisation.
