e datacenter is becoming fully heterogeneous, integrating multiple OS-capable CPUs of di erent Instruction Set Architectures in separate machines. ese machines present diverse performance and power consumption pro les and we show that signi cant potential bene ts for both metrics can be expected, should these machines be able to cooperate in the processing of datacenter, multiprogrammed workloads. We advocate that this cooperation should be enabled at the level of the OS, relieving the programmer from any e ort related to the heterogeneity of the managed machines. We propose a distributed OS architecture running on a fully heterogeneous computer cluster, enabling this cooperation through three main components: the abstraction of the entire cluster in a single system image, a distributed shared memory system, and a heterogeneous scheduler.
INTRODUCTION
Processing speed must keep on increasing. However, it can no longer be done at the expense of an increased power consumption. In the meantime, datacenters are becoming increasingly heterogeneous, integrating multiple, OS-capable, processors based on di erent Instruction Set Architectures (ISAs). While x86 has been for a long time the main ISA in the datacenter [23] , ARM [1, 3, 11, 17, 29] and PowerPC [22] are currently gaining traction. is leads to the notion of fully heterogeneous datacenter [13, 21, 30] . Each integrated type of machine has its own performance and power consumption Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro t or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permi ed. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speci c permission and/or a fee. Request permissions from permissions@acm.org. pro le for certain classes of applications, and even for speci c phases in a single application execution lifetime.
is has been shown in the heterogeneous Chip Multi-Processor (CMP) community [35] , and we demonstrate that it is also valid in a computer cluster. E cient distribution and migration of workloads among machines of di erent types is necessary to leverage the exibility brought by this heterogeneity. Currently, heterogeneous machines with di erent ISAs are already present in datacenters. However, they do operate separately. We foresee signi cant performance and power consumption gains by having these machines cooperating towards the completion of multi-programmed workloads. We argue that all these issues can and should be tackled at the operating system level, without signi cant e ort from application programmer, in order to maximize the solution acceptance.
In this position paper, we propose to implement an OS architecture aiming to enable e cient processes and threads mapping/ migration in a fully heterogeneous computer cluster, in order to maximize a workload performance and/or to minimize its power consumption.
e OS architecture we propose is based on three main concepts. First, we advocate for a distributed, multi-kernel operating system running on a heterogeneous computer cluster providing a single system image of the cluster to the programmer. Multikernel systems already communicate through message passing and well-de ned interfaces, making them a good t for distributed implementation on heterogeneous machines. e single system image abstracts the complexity from working with several heterogeneous nodes, and increases programmability. In addition, it permits datacenter-oriented applications to seamlessly share resources such as lesystem or IPC, a feature well needed in current environments [36] . Second, an e cient Distributed Shared Memory (DSM) system should be implemented at the operating system level, allowing on-demand address space transfer during task migration between heterogeneous machines, as well as the distribution of a task threads across multiple heterogeneous machines; all of this while conserving the convenient shared memory programming model. Finally, such a system brings the need for a task scheduler aware of the heterogeneity of the managed cluster, and capable of taking e cient decisions based on the a nity of the workload towards speci c machines. In addition to OS-level components, compiler toolchain support is also needed to provide multi-ISA binaries with the ability to execute/migrate at runtime on/between heterogeneous machines.
We propose to implement these concepts by augmenting the Popcorn Linux [4, 6] OS. Based on Linux, Popcorn is currently capable of migrating tasks between the arm64 and x86 64 architectures (back and forth) in a two machine setup. We propose to enable Popcorn to run at the scale of a heterogeneous cluster.
In this paper we focus on native execution without the use of virtualization. While some virtualization technologies [7, 15] allow for cross-ISA migration, the performance impact is high and unacceptable in many situations. Other virtualization applications such as containers allow to ease the management of computer clusters and are widely used in datacenter today. However, these technologies do not support cross-ISA migration. More generally, while it is going down, virtualization overhead is still a concern in large scale datacenters [36] .
Our focus is also made on native shared memory applications. Exploring performance and power consumption bene ts for distributed applications such as MPI jobs is an interesting research direction, however we believe that shared memory applications present a strong programmability argument compared to distributed applications.
It should be noted that heterogeneity in terms of performance and power consumption pro les can also be found in a cluster that is homogeneous from the ISA standpoint. e di erences between machines characteristics in such an environment might not be as strong as when considering multiple ISAs, and thus would be the bene ts to reap from exploiting such a cluster. Even though, to obtain these bene ts from such a set of machines, the components of the system proposed in this paper will still be needed.
is paper is organized as follows: in Section 2, we make the case for a fully heterogeneous computer cluster in which machines of di erent ISAs cooperate in processing a datacenter workload. In Section 3, we detail our system proposal, before concluding.
THE CASE FOR HETEROGENEOUS COLLABORATION
Static Job Mapping. For a given workload, di erent types of machines will exhibit di erent performance and power consumption behaviors due to the workload characteristics, de ned according to multiple metrics: potential parallelism inherent in workloads, need for fast sequential processing, needs in terms of memory latency or bandwidth, disk or network I/O latency, etc. We ran standard benchmarks on a set of heterogeneous machines, equipped with the following processors: (1) A desktop-class Intel Xeon E5-1650 [18] with 6 cores clocked at 3.5 GHz; (2) a serverclass Cavium underX [12] with 2 GHz clock frequency, both in single node (48 cores) and dual node (96 cores) versions; (3) an APM XGene [24] with 8 cores clocked at 2.4 GHz. ese machines have di erent ISAs: x86 64 for the Xeon, arm64 for the Cavium and XGene. ey also exhibit strongly di erent pro les in terms of power consumption and various performance related characteristics such as clock frequency, number of cores, cache size, etc. We used the popular NPB OpenMP (class B and C) and PARSEC (native inputs) suites. We also performed a simple I/O test using the dd command to read on each machine a 512 MB le mounted on a NFS lesystem served from a remote machine. Concerning the best performance, the Xeon is the best machine in 48% of the benchmarks because of its high clock frequency, followed by the dual Cavium (41%) because of its high core count. Concerning the energy consumed, the single Cavium arrives rst (56%) as it is a good core count/power consumption trade-o followed by the Xeon (41%). XGENE gives the best power consumption 
Dual node Cavium ThunderX
Xeon E5-1650 Figure 1 : Phases in the PARSEC blackscholes only for the dd benchmark that is not CPU intensive. From these results, we can draw several conclusions. First, considering each metric, there is no ideal machine that would always yield the best results for the entire set of benchmarks. Second, for one particular benchmark, very o en the machine o ering the highest performance is not the one yielding the lowest energy, showing that going faster does not necessarily translate into lower energy consumption. Moreover, this second observation shows that according to the metric one wants to optimize (performance/power), tasks should be mapped to di erent machine types. ese conclusions show that the gains brought by a fully heterogeneous cluster promise to be signi cant, assuming an e cient static job mapping.
Leveraging Time A nity through Task Migration. Some of the benchmarks from the suites we considered exhibit multiple phases in their execution lifetime. We de ne a phase as a portion of a program lifetime during which the program shows some a nity towards a speci c type of machines for a particular metric (e.g., performance, power consumption, etc.). Some programs are constituted of multiple phases, and in each phase the a nity for one metric goes to a di erent machine. e concept of program phases has already been investigated in the chip multiprocessor community. Research in that eld de nes the program phases according to low level parameters extracted from hardware performance counters, such as instructions per cycle or CPU cache misses [31] [32] [33] 35] . Such phases are micro-phases and the migration overhead in a cluster will be longer than in a CMP. us, it will be di cult to exploit such phases. We propose to work on macro-phases, de ned according to high-level machine characteristics: number of cores, clock frequency, NUMA memory organization, etc. ey are signi cantly longer than micro-phases.
To illustrate the concept of macro-phases, we take the blackscholes benchmark from the PARSEC suite as an example. e power consumption observed over time for that benchmark on the Xeon and the dual Cavium are represented in Figure 1 . e benchmark exhibits some strongly sequential initialization and deinitialization phases on both machines. We con rmed that by monitoring the CPU activity, showing that only one core is active (utilized at 100%).
e second phase of the workload is a highly parallel one, using all the cores and leading to the maximum power consumption on both platforms. e sequential phase is considerably faster (3.6X) on the Xeon due to its high single threaded processing power. For the parallel phase, it is more than twice faster on the Cavium because of its high number of threads. Assuming a zero-cost migration time, one can estimate that for this benchmark, executing the right phase on the right machine would yield a 32% execution time reduction compared to an only Xeon execution, and a 63% execution time reduction / 76% energy consumption reduction compared to an only Cavium execution.
Preliminary results indicate that such sequential and parallel phases can be found in other benchmarks. e analysis presented here is made according to the relation between the potential for parallelism of a phase, and the number of cores in a machine. One can imagine other types of program/phases to machine a nity such as memory access pa ern of a program according to the memory hierarchy organization of a machine (in particular the NUMA characteristics of some machines); or low idle power consumption on I/O intensive workloads.
Space distribution: Splitting reads on ISAs. a Distributed Shared Memory (DSM) system is needed to provide e cient ondemand address space transfer in the case of a task migration between heterogeneous machines. Such a DSM implementation will provide the additional bene t of enabling the distribution of threads from the same application over multiple machines of di erent ISA. In the next paragraphs, we discuss the potential bene ts to be gained in doing so. As cross-ISA migration on phase boundaries could be de ned as a nity in time, distributing threads of an application over di erent ISAs would be leveraging a nity in space.
is raises the following question: Would certain applications bene t of having some of their threads executing one one ISA, and some other threads on another ISA?. Some applications spawn threads that perform similar work (for example the thread pool model), and it is very probable that there is no space a nity in that case: indeed, because all threads have the same behavior, the a nity would probably go to a single machine. However, space a nity might arise when the work performed by the threads is signi cantly di erent.
An interesting example would be a Database Management System (DBMS). DBMS divides the entire query processing into several tasks, such as query parsing, index lookup, index building, logging, storage management, and so forth. ese tasks are handled by dedicated threads that communicate each other via messaging queue.
is DBMS design maximizes the query processing capacity by pipelining tasks, improves its scalability, and provides the exibility. Due to the diversity between these tasks, these threads should exhibit di erent characteristics, showing a nity towards di erent ISAs.
Another potential bene t brought by the capability to distribute threads among ISAs would be ne-grained power capping or resource management in a heterogeneous cluster, when it is possible to migrate a single thread of one application to free some cores on a powerful machine for an upcoming job with a strong a nity toward that machine, or to accommodate a power cap.
SYSTEM MODEL
Hardware Organization. To provide the heterogeneity in a datacenter, an intuitive approach is to keep the homogeneity (in terms of ISA) within a rack but to allow the heterogeneity between racks. Each rack is comprised of a number of machines having the same ISA, computational power, and con gurations, whereas di erent racks may run on di erent ISA and exhibit di erent performance/power consumption pro les.
In other words, the heterogeneity is provided on a per-rack granularity. is approach is simple and easy to apply to datacenter environments. However, we argue that this per-rack granularity cannot fully utilize the bene ts of the ne-grained ISA a nity nor thread migration. Indeed, in order to take advantage of the ne-grained ISA a nity through thread migration, the migration overhead between heterogeneous nodes should be low enough not to o set the bene t. However, in the per-rack heterogeneous con guration, the distance between two heterogeneous nodes is important as inter-rack interconnects have narrow bandwidth and long latencies, making crossing the rack boundary costly. us, it might be di cult or infeasible to capitalize the bene t from the per-rack heterogeneous con guration.
In this sense, we claim that the heterogeneity should be provided within a rack. We de ne our heterogeneous rack model as follows.
e rack contains a given number of bundles. A bundle is de ned as a group of heterogeneous nodes that are connected each other via a high-speed low-latency, point-to-point interconnect technology. In a concrete example of the heterogeneous rack we have built, a bundle consists of one Xeon E5-2620v4 server connected to a dual-node Cavium underX server [12] using a point-to-point interconnect over PCIe using Dolphin PXH810 [14] . e PCIe interconnect provides 56 Gb/s bandwidth and a stable low latency without switching latency, which is essential to lower the thread migration overhead. In addition, a Xeon Phi 7120p [19] is installed on a PCIe 16x slot in the Xeon server as a co-processor. is con guration provides us with diverse heterogeneity from various aspects; di erent ISAs (x86 64, K1OM, and arm64), core counts (16, 61, and 96), single-core performance, energy e ciency, interconnect speed, etc.
is bundle is the building block for the heterogeneous rack. We believe this tightly coupled model is indispensable to keep the networking latency low between heterogeneous machines during the performance-critical thread migration and memory consistency protocols. Currently, our rack system is con gured with eight bundles (8 Xeon servers, 8 Cavium servers, 8 Xeon Phis). All servers in the rack are connected via In niBand using a Mellanox SX6036 switch to communicate each other. We envision this rack as being a building block example for the heterogeneous datacenter by deploying an additional switching layers between multiple racks. Figure 2 illustrates the organization of our proposed system. While we currently consider only two servers and one co-processor in a bundle, we plan to investigate on di erent con gurations comprising additional nodes of various ISAs such as Power8, SPARC, and Intel Knight's Landing.
Systems So ware. As discussed in the literature [10, 34] , converting a regular shared memory program wri en for a single-node system to a distributed model, using for example the message passing interface (MPI), requires a huge amount of e ort, impairing the programmability. Furthermore, the diverse heterogeneity aspects between nodes will pose more complexity on top of the reduced programmability. is will prevent users and systems from fully utilizing the advantages of the heterogeneous system, and thus will reduce the acceptance of the proposed solution. In this sense, we argue that the OS is the proper level to integrate systems soware managing the complexity of having heterogeneous machines within a rack, and to abstracts that complexity from the application programmer and the system administrator. As the operating system for a heterogeneous rack-scale system, we will extend our previous work called Popcorn Linux [5, 9] . Popcorn Linux is an open-source project based on the Linux Kernel. Originally, it aimed to run replicated Linux kernels on each core in a multi-core machine. Based on the feature that provides a single system image over multiple kernels, the project was extended to span over multiple machines having di erent ISAs. As a multikernel, Popcorn Linux is particularly well suited for a distributed implementation at a larger (rack) scale, as kernels already communicate with message-passing through well de ned interfaces. Currently, Popcorn Linux supports Xeon-Xeon Phi and Xeon-ARM XGene con gurations, i.e. pairs of machines. In particular, Popcorn Linux is able to migrate tasks between the x86 64 and the arm64 architectures with a sub-second latency [4] . Note that this number does not contain the address space transfer latency that happen on-demand a er migration time.
Popcorn Linux lacks features such as migrating a thread to a machine running the same ISA, spanning a process over more than two machines, and so forth.
ese features are not only critical for the multiple nodes setup but also required to capitalize from the heterogeneity in the rack. us, we are extending the current Popcorn Linux to support these features on various heterogeneous con gurations.
In order to provide exible thread migration within a rack comprised of many heterogeneous nodes, we are primarily extending the memory consistency protocol that Popcorn Linux currently implements. Currently, Popcorn Linux only deals with the case for two hosts; a page that does not exist in the current node is in the other node. In this case, the page is brought to the node via the interconnect to resume the execution. e system calls that manipulate a virtual memory area layout (e.g., mmap, munmap, etc.) are played at the home location of the current thread. is minimalist feature for the distributed execution has been the major bo leneck even in the two-node setup.
To tackle that limitation, we are working on implementing a per-process level distributed shared memory (DSM) protocol in the kernel space. DSM is a long-discussed concept since the 1990's [2, 8, 20, 28] . Although the promising strong point of DSM is that users can run regular shared memory applications without modi cation on several nodes in a distributed way, DSM did not gain much popularity outside of the research area at that time. One of the main reasons for this was the low performance of complex DSM protocols over slow interconnect technologies. Indeed, DSM protocols keep the consistency of pages by exchanging data over the network. us, the bandwidth and latency of the interconnect are critical to the system performance. In 1990s and 2000s, commodity interconnect technologies were not fast enough to guarantee the desirable performance. Some devices providing fast interconnects were very expensive so that they cannot be applied to a rack-scale computing environment. However, these days, the interconnect speed has increased considerably. For example, PCI Express v4.0 is designed to provide 16Tb/s [27] . Such a high interface performance helps to close the bandwidth gap between the memory bus and peripheral interconnects. Also, recent research discussing the DSM topic in modern interconnect setups shows promising results [16, 25, 26] .
Another important point about DSM concerns coping with failures of nodes. While this is not the primary focus of the research presented here, we plan to address this issue through regular checkpointing of DSM-enabled processes address spaces. We foresee that solutions based on replication will introduce overheads that could be very high. On the contrary, given the type of workloads we focus on (long running, compute intensive jobs) regular checkpointing overhead should be acceptable.
Based on these premises, we are reviving the case for the DSM in the rack-scale computing. We expect that the tightly coupled con guration in the bundle helps to minimize the overhead to communicate data for DSM protocol.
To fully utilize the cores abstracted into a single system image, the system requires a heterogeneous scheduler that assigns processes and threads to heterogeneous cores sca ered in the rack, according to multiple metrics such as the a nity of a thread (or the current macro-phase a thread executes) towards certain ISA, and the current system load. We anticipate that the strong heterogeneity will make the design of the job scheduler a very challenging issue. Followings are a number of questions that need to be answered in designing such a scheduler.
First, where will the scheduler run? It could be ran on a single machine, or in a distributed fashion among the nodes. In either case, one need to consider communication costs of exchanging scheduling information. We plan to design the scheduler in a distributed way. Second question is: where should an upcoming job rst be dispatched? We are also working on predicting the ISA a nity for a job based on past behavior, or on information that can be extracted from the binary. We also consider embedding a nity information in the binary at compile time. Another question is: how to assess migration cost? e cost to migrate a thread will not be static given that the DSM protocol complicates the thread migration. e scheduler should thus estimate the bene t or overhead to migrate a thread.
A nal question is: how to assess at runtime the value of the parameters based on which scheduling decisions will be taken? Following our example with the PARSEC blackscholes benchmark that exhibits several phases in terms of needs for parallelism or fast sequential processing power, an intuitive idea is as follows: Detecting that an application currently executing on a high number of slow cores (for example a Cavium) should be migrated to another machine and given less but faster cores (for example a Xeon) can easily be done by checking CPU activity: a good indicator of that scenario would be an application using 100% of a single core, as it is the case with the rst phase of blackscholes. Detecting the inverse scenario, i.e. an application currently executing on a small number of cores that is in need for more parallel execution units, i.e. more cores, we still plan to monitor CPU activity. In addition, assuming cores are dedicated to applications, we plan to investigate monitoring the number of context switches for the application's threads on these CPUs. A high number of context switches combined with a high CPU usage might indicate the fact that the application has currently more active threads than its number of assigned cores, and it should bene t from a migration to a high core count machine such as the Cavium. e design of the job scheduler is still an open question for now, and we will investigate on these cases and try to nd reasonable answers to them.
CONCLUSION
In order to leverage the potential bene ts in terms of performance and power consumption for the fully heterogeneous datacenter, we advocate for OS support using a multi-kernel providing a single system image, DSM support and heterogeneous scheduling.
is work is supported in part by ONR under grant N00014-16-1-2711 and AFOSR under grant FA9550-16-1-0371.
