ABSTRACT To achieve high computation throughput, heterogeneous architectures utilize many specialpurpose cores to work as floating point computing coprocessors. Popular programming models typically offload computing intensive operations to coprocessors and then aggregate the results. This approach results in the need of transferring a large amount of data via the peripheral component interconnect express (PCIe). To leverage the limited bandwidth of PCIe, we develop a reverse offload (rOffload) model that treats the autonomous Intel Many Integrated Core (MIC) coprocessor as the host processor while the CPU is treated as the coprocessor. The MICs orchestrate the computation and offload work, which cannot be accelerated on MIC, to the CPUs, thus reducing the overhead introduced by moving data among distinct memory regions. In this paper, we present an overview of rOffload, including the basic programming interface and its implementation on a CPU-MIC system. The results from benchmarking and from application experiments conducted on the Tianhe-2 supercomputer demonstrate the efficiency of our rOffload model in terms of programmability, portability, and performance.
I. INTRODUCTION
Massively heterogeneous architectures that use multi-core CPUs and many-core coprocessors have supported the attainment of petascale systems, with future planning looking toward exascale systems [1] , [2] . Multi-core CPU increases instruction-set architecture (ISA) parallelism, while manycore coprocessors are concerned with the strength of parallel throughput [3] . Both together contribute to achieving better performance than CPU-only systems, and there are increasingly more coprocessor-based supercomputers appearing in the Top500 [4] lists.
The primary challenge imposed by heterogeneous systems is programmability [5] - [8] , which provides the bridge between an application and the implementation of that application on available hardware [9] - [11] . With the introduction of the Intel MIC which acts as an autonomous compute node with its own IP address, we address the issue of programmability by developing a reverse-Offload (rOffload) programming model for CPU-MIC systems. In this model, the MICs orchestrate the computation and offload work to the generalpurpose processors. We offer the following contributions:
• Instead of using MIC for compute-intensive parts of applications, we use CPU as accelerator for irregular codes. We develop rOffload, a framework that provides the means to offload computations from MIC to a host CPU through a unified API. To better handle the case of MPI communication, we also provide a MPI proxy on the MIC that is completely transparent to programmers.
• Building on a Many Integrated Core Platform Software Stack (MPSS), we propose a compiler-independent runtime framework that enabled our offloading patterns. We use both Intel's Symmetric Communication InterFace (SCIF) and standard MPI as its back-end to enable the task offload and the introduced communication.
We also provide a hybrid DAG-based scheduling optimization to enrich the parallelism.
In general, the main memory of CPU in a typical heterogeneous system is larger than the device memory of coprocessor [12] . To solve problems involving very large data sets, the data volume may be too large to fit into the in-card memory of the coprocessors. We solve this problem for our proposed approach by reorganizing inter-node task partitioning. So that we could aggregate memory among the MIC processors, then allow the work-set to be stored on the coprocessors exclusively. In future systems, we expect that processor memory will be larger, at which point our proposed approach will not need this solution, and will appear to be more attractive.
The rest of this paper is organized as follows. Section II explains our motivation. In Section III, we introduce the rOffload model. Section IV describes a prototype implementation of rOffload, including the framework and software architecture. Section V presents an evaluation of performance. Section VI discusses related work, and Section VII provides our conclusions.
II. MOTIVATION
The typical heterogeneous system is intra-node heterogeneous (with PCIe connecting CPUs and coprocessors), and the inter-node is often homogeneous. MPI+X is promising with regard to these systems, where MPI is the de facto standard for inter-node programming, and X is coprocessorspecific for intra-node programming [13] . Generally, the CPU cores target general-purpose workloads that are characterized by complex control flow, large working-set sizes, irregular memory access patterns, relatively light floating-point activity, and limited parallelism. Meanwhile, the coprocessor cores target more specialized workloads that exhibit relatively infrequent branching, small working-set sizes, regular memory access patterns, heavy floating-point activity, and abundant parallelism. Thus, the state-of-the-art X models always treat the general-purpose cores as main processors and the special-purpose cores as coprocessors whose role is to speed up pieces of the application [14] . The approach of this model is to locate the hotspot of an application, offload it to an accelerator, and aggregate the result. The programmer may rewrite the code for an industry standard architecture e.g., CUDA for GPU. There are also pragma-based APIs, such as OpenACC, LEO, and OpenMP4.
For most offloaded kernels, data must be moved to the device prior to being used by the kernel, but it is more important to recognize that calculations performed on data are only worthwhile when the results of those calculations can be utilized further. Thus, data would be transferred into and out of the coprocessor for heterogeneous applications. Table 1 lists the memory and operation complexity of benchmarked kernels (the ''Seven Dwarfs'' from Berkeley [15] but MG), and the relative amount of data that each kernel transfers to the coprocessor's on-device memory and then back to the CPU's main memory after the kernel completes.
Most of the kernels need to transfer large amounts of data in-and-out/in-or-out of the coprocessor, with the exception of Monte Carlo, which needs only a small amount of data transfer. Since the memory of a coprocessor is separate from the host's main memory (unified memory architecture such as APU from AMD is not within the scope of this paper), it is necessary to communicate input/output data via the PCIe bus that provides the motherboard interconnect between the CPU and the coprocessor. All memory transfers between the CPU main memory and the coprocessor main memory flow over the PCIe connection. In many cases, data transfer over the PCIe connection contributes the majority of the time for the limitation of bandwidth. Reporting performance comparisons of heterogeneous applications falls short because reporting does not describe adequately the memory-transfer overhead incurred by applications that run on a GPU in typical use cases [16] . It is possible to stream data asynchronously using PCIe while data is being acted upon on the coprocessors, but this approach may harm the modularity, depends on the ratio of computation and data transfer overhead, and relies enormously on hand-tuned code. Thus, the system performance may be limited by the bandwidth of PCIe transfer.
With the introduction of Intel MIC architecture, an alternative to GPUs in the field of coprocessor-enhanced computing became available. MIC architecture runs a fully user-accessible Linux OS together with other operational software and services provided by the Intel MPSS. Unlike GPU, MIC can act as an autonomous compute node with its own IP address. In particular, MIC can execute programs natively without involving the host, which enables it to participate directly in parallel program executions using MPI. Thus, the data are near the computing, and the overhead incurred by transferring data is avoided. However, the MIC cores feature lower frequency, and they are not fit for irregular computing. To overcome these limitations, we introduced the rOffload model to offload work from the MIC towards the CPU. In addition, we implemented an MPI library to aggregate the MIC cards of the inter-nodes.
III. THE ROFFLOAD MODEL
In this section, we introduce rOffload, including its execution model and the proposed API that serves as its front-end.
A. EXECUTION MODEL
The execution model is based on the hardware configuration of heterogeneous systems. Here we take Tianhe-2 as an example. The Tianhe-2 heterogeneous computing node contains two general CPUs and three MICs. The CPUs share the same memory space while each MIC has its own separate memory. PCIe works as a bridge which joins the multi core CPUs, MIC, MPI communication components (IB card), and other parts. The role of the CPU and MIC can be flexible. Conventionally, the The special MICs are used as floating point computing accelerators, and the general CPUs are used as host processors. The MPI processors run on the host CPUs, while the offload is directed towards the MICs. Turning the current Offload model upside-down, we treat a hybrid system as a cluster of high-speed, special-purpose cores, each with an attached general-purpose core for offloading control, memory, and input/output (I/O) intensive work. MPI processes run on the MIC only. The main procedure is launched on the MIC, and the offload is directed to the host Xeon CPU. As the CPU features higher ISA parallelism and larger cache size, the MIC would send control-intensive work to the CPU cores, and aggregate the results. Figure 1 illustrates how a programmer views the system in the Offload model and in our rOffload model. In the Offload model, the CPU manages the computation, sending compute-intensive work to the MICs and aggregating the results. In the reverse-Offload model, the MIC (or Xeon Phi) manages the computation, sending control intensive work to the CPU and aggregating the results. Existing MPI implementations executed on MIC processors can exchange data directly with other MPI processes running on a remote host or remote MIC processors. Still, observations demonstrated that the communication had fatal defects on performance in terms of bandwidth and latency, resulting in very poor channel utilization [17] . In addition to control-intensive computing offloading, we leveraged the CPU core for MPI communication by providing an MPI proxy on the CPU that could be invoked by MIC calls.
Compared to the existing Offload model, our proposed rOffload approach sends control-and communication-intensive parts to the CPU, which alleviates pressure on the PCIe, and generates less data transfer latency. Thus, we realized zero data loading for launching the MICs, and achieved datanear computing. At the same time, we already know that for higher efficiency, it is better to keep busy the cores that have higher compute rates than those with lower compute rates. This model made the program start on accelerator, abundantly express the massive threads computing superiority of accelerator. While the CPU runs according to the accelerator, it can give more opportunity for the accelerator to run fully with less waiting. This approach keeps the system in high performance mode, and reduces power dissipation through idle CPU time.
B. API
In this section, we describe the rOffload API that enables programmers to annotate code in a style that is conceptually similar to OpenCL which is available as library calls. This arrangement makes it easy to port existing code, and allows for a direct performance comparison with forward offloading. As discussed, the control-intensive computing and MPI communication tasks can be offloaded directly to the CPU, thus providing two kinds of API: a task offloading layer and a message passing layer.
1) TASK OFFLOADING LAYER
The task offloading layer supports a Remote Procedure Call (RPC) mechanism for invoking functions on the generalpurpose processor, and receives the results. The rOffload provides the functionality of the reverse offload API as a front-end. Table 2 presents an overview of the main API elements, which fall into four categories:
1) Task management is used to offload the tasks from the MIC to the CPU. We accomplish that offload by using the API of RunFuncAsync and RunFuncSync, which enqueue a function in the remote process binary to be executed. To enable a convenient schedule, the programmer could set the input dependencies. A pointer to a RunEvent object signals completed execution of the asynchronous function. Then, the API of RunWait waits for an arbitrary number of RunEvent to be signaled as completed. 2) Data management is defined by BufferCopy to specify data movement between the CPU and the MIC. We provide asynchronous transfer to make up the PCIe transfer limitation. If the completion event is passed, then the write is asynchronous. In addition, if more than one DMA channel is enabled, operations to the same buffer can happen in parallel if they can be assigned to different DMA hardware. 3) Thread-and data-level parallelism are leveraged using a third-party component, e.g. OpenMP, for the X86 compatibility of the MIC ISA. Up to this point, we have kept rOffload simple, including only the core features essential to demonstrate its effectiveness. Once we have gained more experience with rOffload, we expect to enrich rOffload with more functions to turn it into a more productive programming model.
2) MESSAGE PASSING LAYER
We developed a software message passing layer that enables us to program large scale CPU-MIC systems using the rOffload model. In our model, MPI communications execute on the CPU. We enable the developer to employ MPI VOLUME 7, 2019 APIs directly on MIC through our message passing layer. Thus, we make it possible to inherit a large amount of legacy code, since MPI is the de facto standard for programming parallel computers and workstation clusters. Although the MPI interface defines hundreds of functions, it is not difficult to include all of them in a MIC library because most MPI programs utilize only a tiny fraction of the functions that MPI provides. The functions currently implemented are:
MPI_Send(), MPI_Wtime(), and MPI_Wtick(). Initially, our implementation goal is to provide the most related MPI functions, and incrementally introduce additional functions later as needed.
IV. IMPLEMENTATION OF THE rOffload PROTOTYPE
The rOffload system manages offload from MIC processes to several different CPUs, and aggregates the result. We used both Intel's SCIF and a standard MPI as its back-end to enable task offloading and communication. We also provided a hybrid DAG-based scheduling optimization to enrich the parallelism. half of the figure is the stack running on the coprocessor. The top half of the figure represents user space, while the bottom half represents kernel space. rOffload is user-level middleware, and it was designed to be lightweight and completely transparent to users. It interacts closely with both user processes and other kernel-level components. It controls offload scheduling and dispatch by intercepting SCIF API calls. This is a key mechanism in rOffload that enables us to gain transparent control of how offloads are managed. Every offload is converted into a series of SCIF calls, which are part of a standard API supported by Intel. These SCIF calls are used to move data between the host and coprocessor, and to launch offloaded kernels on the host. By intercepting these calls, rOffload controls how offloads are scheduled and dispatched transparently. Figure 3 shows the three main components of rOffload: the monitor that executes on the host and coprocessor, and the scheduler that executes only on the MIC. Each one works separately. These processes communicate using explicit messages.
B. SOFTWARE ARCHITECTURE

1) CARD-SIDE MONITOR
The card-side monitor collects data about the state of the MIC and the CPU; it also communicates with the scheduler. One of the most important functions of the card-side monitor is to intercept and redefine every rOffload API function. The redefined functions first perform rOffload-specific tasks, such as communicating with the scheduler, and then they invoke the original function on the CPU side. A common and efficient way to implement this concept is to include the code address of a handler function in the message. When a message is sent, the code at the embedded handler address will be executed on the CPU. The address of the message's payload is passed as an argument to its handler. Then on the card-side, which is an active endpoint, the message will be put forward to the CPU end. The resolving work will be done by the CPU-side monitor.
2) CPU-SIDE MONITOR
A key part of the CPU-side monitor is the client that intercepts function calls and communicates with the server from the MIC. The purpose of the mechanism is to gain transparent control of how rOffload is managed. To accomplish this end, we used a global while-switch construct as shown in Algorithm 1. Code that is to be offloaded with rOffload has to be encapsulated in a function, so that these functions can be executed upon receiving a particular message. This approach requires the program to be composed in codeblocks (functions f1(), . . . , fn()). The processor performs continuous checks to determine whether a message (with any tag) has arrived. Whenever a message arrives, the processor invokes the appropriate code-block, depending on the tag of the message. case tag 2 :
recv(tag 2, &m 2 );f 2 (m 2 );break; 8: ...
9:
case tag 1 :
recv(tag n, &m n );f n (m n );break; 11: default: 12: print(''Unknown task''); break; 13: end if 14: end while This module also monitors the load on each CPU, the health of each rOffload process (i.e., whether the process is alive or not). If a process dies, the monitor reports the reason for the problem for possible use by the user or administrator.
3) SCHEDULER
Heterogeneity makes design and implementation of portable scheduling a challenging issue. Taking asynchronicity as our guides, our goal is to obtain high performance by executing as many concurrent tasks as possible. Using our scheduling model, the sequence of tasks could then be orchestrated for an out-of-order execution.
The tasks in rOffload fall into the following three categories: A. CPU computing (including MPI communication): In the rOffload model, CPU cores are utilized to perform some part of the computations and are responsible for MPI communication. Here, we treat the MPI communications commands on the CPU as a kind of CPU computation.
B. MIC computing:
The operations that perform the computations that run on the MIC.
C. CPU-MIC communication:
For the separated memory of the CPU and MIC, this category consists of operations that are used to transfer updated values from MIC memory to CPU memory and vice versa.
rOffload uses three task queues to deliver each computing category running on the corresponding computation resource, as shown in Figure 4 . The task queues then store generated tasks in a single-linked list to keep the original sequential order between tasks. Each task consists of the information of a task's successor and predecessor. Based on the DAG information, when a task modifies its output, the runtime system can scan the lists to search for the tasks waiting for the output.
Two operations can be performed on a queue: task submission (using a push or insert method by the card-side monitor), and a request for task execution (using a pop method by the host-side monitor). After execution starts, all the tasks are inserted into the scheduler by calling the push method. When a task becomes ready (i.e., all dependencies are fulfilled, and all the predecessor tasks have been scheduled), the task will be popped for execution by calling the pop method. The push operation is invoked by the master thread. While generating a new task, the master thread inspects every input and output of the task. After a task completes and modifies its output tile, the pop operation, which is also invoked by the master thread, scans the list from the tasks just completed to find which other tasks are waiting for this task. The actual scheduling decision can be made at any time between the push and pop steps. We provide the same interface for all implementation strategies. This convenient framework allows experts to implement state-of-the-art scheduling strategies that can be used transparently within actual applications. Here we implement the first in, first out (FIFO) strategy. Since scheduling decisions are made only when tasks become ready, such scheduling strategies are naturally dynamic, even though it is possible to integrate static knowledge (user-provided or obtained at compile time) within the various methods of implementing the strategy.
V. EVALUATION A. EXPERIMENT SETUP
This section reports our evaluation of the rOffload framework. Our evaluation is divided in two parts. First, we demonstrate the efficiency of the rOffload implementation via micro benchmark test. Second, we test our proposed framework with two real applications. All experiments are conducted on the Tianhe-2 supercomputer. Each node of Tianhe-2 is composed of two Xeon E5 CPUs and three Intel Xeon MICs. The interconnect network topology is an opto-electronic hybrid hierarchical fat tree. The configuration details of the platform are listed in Table 3 B. MICRO BENCHMARKS
1) OVERHEADS OF rOffload
To quantify the overhead of rOffload, we carry out micro benchmarks and compare the cost for an empty offload of the different implementation. All measurements are repeated 10 5 times, and the average values are evaluated.
To directly compare rOffload and LEO, we investigated the offload on a single node. The results are shown in Figure 5 . The average time to invoke an rOffload kernel asynchronously from a MIC Xeon Phi processor to a CPU is 2.3µs. This time cost is measured by launching an empty kernel asynchronously without synchronizing. The average time to wait for a ready event is 0.12µs. The overhead increased to 14.1µs with synchronization between those empty kernels. Similar benchmarks for LEO on a MIC show an overhead of 2.6µs with no synchronization between kernels, and 20.5µs with synchronization between kernels. The average time to wait for a ready event is 1.2µs. These results show that the mechanism behind rOffload effectively controls the software overhead for remote code invocation, and provides better latency than LEO. Our implementation profits from use of the fundamental SCIF communication library which is the fastest and most low-level API for communication between CPU and MICs.
2) PERFORMANCE OF MPI COMMUNICATION AMONG MIC CARDS
In this section, we present a comparison of the internode and intra-node latency and between processes running on MICs. As shown in Figure 6 , for transferring short messages, implementation of MPI over IB verbs (directIB) offers the path with least latency, and thus produces the least latency. There is not much difference between the performance of directIB and rOffload implementation for the use of scif_send/recv, which introduces little latency during the transfer of short messages. As we move to larger messages, where the peer-to-peer bandwidth bottleneck come into play, our MPI design outperforms directIB. As shown in Figure 7 , our implementation improves the latency for 4 MByte messages by 51.5% when compared with the directIB-based design. Based on SCIF RMA transfer, our design gains slightly from the high performance channels it used, especially for large messages. Figures 8 and 9 show a comparison between directIB and our implementation in terms of inter-node/intra-node unidirectional MPI bandwidth. This version of directIB is limited by the peak intra-IOH peer-to-peer PCIe read bandwidth on the IvyBridge platform, which is around 1,000 million bytes/sec. Our implementation provide a 2× improvement over the bandwidth offered by Direct IB for 4 MB messages. We observe that the bandwidth performance grew slowly with the message size, because of the limited performance of the MIC-initiated DMA transfers. As the SCIF transfers are initiated by the proxy from the host, its bandwidth is also limited by peer-to-peer PCIe write bandwidth. Our work achieves the maximum uni-directional bandwidth performance among all the designs, but do not achieve close to peak bandwidth because of overhead involved by the RPC of MPI communication from MIC.
C. APPLICATION PERFORMANCE
To compare the performance of our proposed rOffload model with the Offload model, we test them both using code for High Performance Linpack (HPL) [18] and High Performance Conjugate Gradient (HPCG) [19] , [20] . For the comparison, we produce a hybrid version of the code for HPL and HPCG using the rOffload model and proposed API. First, we compare the amount of data transfer incurred for each. As a second test of the reverse-acceleration model, we examine how well our version of HPL and HPCG scale, and how its performance compared to that of the traditional hybrid (Offload) version of the program. HPL and HPCG are well-known performance metrics. HPL measures the performance of a computing system by solving a dense NxN system of linear equations of the form Ax = b. The solution is obtained by performing LU factorization of the coefficient matrix A with partial pivoting, and then solving the resulting triangular system of equations. HPCG utilizes the symmetric Gauss-Seidel preconditioned Conjugate Gradient algorithm, and solves a sparse linear system based on a simple additive Schwarz method. Designed to measure performance, HPCG represents many important calculations, such as finite volume, finite element, or finite difference discretization. It also provides coverage of the major communication and computational patterns, and rewards investment in local memory system performance. The Offload versions of HPL and HPCG are taken from MKL [21] , [22] and Yang [23] . And we develop the rOffload versions of HPL and HPCG with the proposed API and runtime.
1) MEASUREMENT OF DATA TRANSFER VIA PCIe
We compared the traditional Offload-HPL to our rOffload-HPL implementation for data transfer via PCIe. As rOffload-HPL is designed to be MIC-centric, the complete coefficient matrix is stored in the MIC's device memory. While using the traditional offloading GEMM methods, a.k.a. the CPU-centric approach, the work set could not fit in the MIC memory. The MIC could load only a subset of data from the CPU memory domain for GEMM [24] computation, and then write data to the CPU's memory after computing was done. This procedure would repeat several times for each iteration of factorization. These back-and-forth data transfers took place between the host and coprocessors via the PCIe bus. Figure 10 compares the amount of data transfer by Offload-HPL and by rOffload-HPL. The X axis of the graph is the size of the matrix (ranging from 1k to 48 K), and the Y axis is the amount of data transfer (log scale) via PCIe, because of the large range of the numerical values. As shown, our rOffload implementation significantly reduce the amount of data transfer incurred. Note that HPCG is designed to solve a spare matrix problem. We do not show the result, for the comparison on data transfer is not that remarkable,
2) PERFORMANCE COMPARISON
As a final test of the rOffload model, we examine scalability. Figures 11 and 12 plot the performance of HPL and HPCG as implemented with Offload and rOffload. Scalability is a key challenge for HPL and HPCG on any system. We scale the number of nodes to 64 for HPL and 8,192 for HPCG to further strengthen the results of the comparison testing. Both the Offload and rOffload versions of HPL and HPCG are executed in weak-scaling mode, i.e., we increase the problem size proportionally with each increase in node count. As Figure 12 indicates, our version scales smoothly all the way to node 8,192, which represents 24,576 MPI ranks. In single node and multi-node tests, rOffload-HPL outperforms Offload-HPL by 7% and 12% respectively. The result that rOffload-HPL scaled better than Offload-HPL benefitted from the following conditions. First, our implementation introduces fewer PCIe transfers, as noted earlier, for single node performance, saving the overhead for hiding PCIe latency. Second, we utilize the CPU cores for MPI communication only, and liberated them from heavy computing. Finally, our scheduling optimizations could hide the MPI communication. Meanwhile, rOffload-HPCG achieves a slightly worse result (about 5%) than Offload-HPCG for single node performance. It is reasonable because the Offload-HPCG takes CPU cores into the computing intensive parts, e.g. sparse matrix-vector multiplication. In addition, for this typical sparse problem our work on reducing PCIe transfers does not gain much. While scaling to the 128 node, rOffload-HPCG outperforms Offload-HPCG, and achieved 4.6% higher performance on 8,192 nodes. In other words, rOffload-HPCG scales better than Offload-HPCG. HPCG communicates using small messages and exhibits a very high ratio of communication to computation. Noted that, the rOffload-HPCG achieves an unstable result from 4 to 8 nodes. This is because the task partition of HPCG is to compute the factorization of the total number of processes into a 3-dimensional process grid that resembles a cube as closely as possible. While running on 4 and 8 nodes (12 and 24 processes), the computation can better hide the communication latency on the fat tree network architecture. Despite the fluctuating scores, the rOffload-HPCG always outperforms Offload-HPCG since from 128 nodes. As communication is always a dominated problem when running on distributed machines, our MPI implementation reveal advantages on latency and bandwidth.
VI. RELATED WORK
The Intel MIC coprocessor platform has a software stack, e.g. MPSS, that enables new programming models, including SCIF, COI, LEO, hStreams, MYO, and MPI.
The SCIF [25] is a low-level, socket-like API that provides a low-latency communication channel between clients that can be either the host or coprocessors. The efficiency of SCIF is the result of direct use of the PCIe bus for bi-directional data transfers between two coprocessors or between the host and a coprocessor. COI [26] provides a set of APIs to simplify development of applications using the offload model. COI includes APIs for launching device code, asynchronous execution, and data transfer between the host and the MIC. Based on shared virtual memory, Mine Yours and Ours (MYO) [27] provides a unified programming model for different platform configurations i.e., integrated, discrete, or hybrid means by which accelerators and CPUs connect to each other. Because of detached window addresses and different ISAs for CPUs and MICs, programmers need to annotate code that runs on the different cores. The shared datasets also need to be marked explicitly. Applying the FIFO streaming model to heterogeneous systems, Intel heteroStreams (hStream) [28] supports concurrency across nodes, among tasks within a node, and between data transfers and computation. Newburn et al. [29] give examples for different streaming approaches, and compare the features and versatility of hStreams, OpenMP, CUDA Streams, and OmpSs. Li et al. [30] systematically evaluate the performance benefits of using multiple streams. Language Extension Offload (LEO) is a pragma-based programming approach in which the CPU host controls the entire execution of a code [10] . Blocks of the code can be delegated to the coprocessors for execution. Because memory is not shared between the host and any of the coprocessors, variables and arrays needed in the offloaded code block also must be allocated on the target coprocessors. The content of the coprocessor data can be transferred back to the host if desired. Like LEO, OpenMP4 provides another set of directives that allow the programmer to mark code regions within a host program to be executed on the coprocessor. Since a host and coprocessor are physically separate compute devices, memory transfers between the two are necessary to provide data for computation(s) and get the results.
In addition, numerous compilers and productivity tools have been developed to minimize programming effort and optimize performance for a MIC system. Ravi et al. [31] propose a semi-automatic method for establishing the division of work between the CPU and MIC processors by providing new directives that added relaxed semantics to directive-based languages. Song et al. [32] present source-to-source compiler optimizations (Comp) that can improve the performance of applications that offload computations to manycore processors. Potluri et al. [33] investigate the communication within a node that consisted of a CPU host and one MIC coprocessor. They proposed MVAPICH-PRISM [17] , an MPI implementation that was a proxy-based communication framework using InfiniBand and SCIF for MIC clusters. Previous heterogeneous systems were equipped with more than one coprocessor. Dong et al. [34] present a hybrid programming strategy combining techniques such as MPI, OpenMP, COI, and SCIF to program multiple Xeon Phi coprocessors within one compute node. They enabled bi-directional and asynchronous data transfers between the coprocessors. Evaluation revealed that the low-level COI-SCIF approach resulted in lower communication overhead, in comparison with the MPI-based approach. Standard offload programming is restricted to a single compute node, and consequently to a limited number of coprocessors. Cadambi et al. [35] study the effects of multiprocessing on the MIC, and propose a novel and transparent middleware called COSMIC to alleviate the adverse effects of such multiprocessing. Noack et al. [36] provide the means to offload work to local and remote (co)processors using a unified offload API (Ham-Offload) by adopting the active messages, and combining the MPI and SCIF communication. As MIC processor can be programmed as the main processor and have direct access to the InfiniBand network adapter, Potluri et al. [37] redesign the MVAPICH2 MPI library for efficient intra-node and internode point-to-point communication on Xeon Phi clusters with InfiniBand. Mallón et al. [38] analyze the performance of different algorithms for broadcast, scatter, and gather in a large-scale Xeon Phi supercomputer. We can find that SCIF provides the backbone for the communications of CPU and MIC, and the upper level programming and runtime are developed based on SCIF. Like the related works above mentioned, our work is designed on SCIF and MPI as the backend. Figure 13 gives an overview of the mentioned programming models and the relationships among them.
In terms of programming centric on coprocessors, Pakin et al. [39] propose a reverse-acceleration hybridprogramming model. In this model, the accelerators orchestrate the computation, offloading work that cannot be accelerated to the general-purpose processors. Differendly, data is communicated among accelerators. Kistler et al. [40] store the coefficient matrix in Cell B/E of the Roadrunner (with both the host and the device prossessing a memory space of 16 GB) for Linpack. Thus, the frequency of communications between the CPU and coprocessor decreases greatly. Dongarra et al. [41] implement a resident LU on CPU-MIC system, e.g., MAGMA MIC library. Beckingsale et al. [42] present a GPU-resident adaptive mesh refinement (AMR) library. Our method shares the idea with these earlier efforts by reconstructing heterogeneous algorithms to reduce the data transfer overheads between the host and coprocessors. Our method shares the idea with these earlier efforts by programming centric on coprocessors.
VII. CONCLUSIONS
In this paper, we describe a reverse Offload (rOffload) model for heterogeneous systems to reduce data transfer, lighten the PCIe bottleneck, and strength performance. The rOffload model presents the opposite approach from the Offload model for programming a hybrid system, where a special-purpose core drives the computation and offloads control-, memory-, or I/O-intensive functions to the CPUs. We introduce a prototype implementation for rOffload, including the proposed programming API and a runtime system. The preliminary results from our experiments suggest that rOffload is promising in terms of programmability, portability, and performance.
We observe that during the development of large-scale parallel computing applications, reliability is another concern that need to be addressed in today supercomputing. This is particularly true for the heterogeneous systems, where a node often has complicated composition, e.g., CPU-MIC, which is prone to failure. Future work will include adding new components to our approach to provide fault-tolerant features. 
