INTRODUCTION
Traditionally, so ware and hardware providers have been delivering signi cant performance improvements on a yearly basis. Unfortunately, this is no longer feasible. Predictions about "dark silicon" [25] and resiliency [53] , especially in the forthcoming exascale era [17] , suggest that traditional approaches to computing problems are impeded by power constraints and process manufacturing. Furthermore, since single-threaded performance has been saturated both at the hardware and the so ware layers, new ways for pushing the boundaries have emerged. A er the introduction of multi and many core systems, heterogeneous computing and ad-hoc acceleration, via ASICs and FPGAs [34, 65] , are advancing into mainstream computing.
e extreme scaling of current architectures, from lowpower wearables to high-performance computing, along with the diversity of programming languages and so ware stacks, create a wide spectrum of space exploration for achieving optimal energy-e cient results. Co-designing an architectural solution at the system-level 1 requires tight integration and collaboration between teams, that have typically been working in isolation. e design-space to be explored is vast, and there is the potential that a poor, even if well intentioned, decision will propagate through the entire codesigned stack. us, amending the consequences at a later date may prove extremely complex and expensive, if not impossible.
In this paper we present Beehive: a complete full-system hardware/so ware co-designed platform for rapid prototyping and experimentation (All the available hardware and so ware components of Beehive will be publicly available). Beehive enables co-designed optimizations from the application level down to the system and hardware level, enabling accurate decision making for architectural and runtime optimizations. As a use-case, we accelerate and optimize the complex KinectFusion [47] Computer Vision application in numerous ways through Beehive's highly integrated stack achieving up to 43x performance improvements.
In detail, Beehive makes the following contributions:
• Enables co-designed research and development for traditional and emerging applications and workloads: To achieve this, we tightly integrate the so ware and hardware layers of the stack in a uni ed manner while expanding Beehive's reach to complex applications and workloads (Section 2.2). We showcase that capability by implementing a Java-based version of KinectFusion and co-designing it through Beehive's stack.
• Enables co-designed compiler and runtime research for multiple dynamic and non-dynamic programming languages in a uni ed manner: is is achieved by unifying under the same compilers and runtimes, high-quality production and research Virtual Machines able to execute transparently multiple programming languages (Section 2.3.1).
• Enables heterogeneous processing on a variety of platforms such as ARM (ARMv7 and Aarch64), and x86: e uni ed runtime layer has been extended to support multiple ISAs scaling from highperforming x86 to low-power ARM architectures (Section 2.3). We showcase that capability by evaluating standard benchmarks along with the KinectFusion use case.
• Provides fast prototyping and experimentation on heterogeneous programming on GPGPUs, SIMD units, and FPGAs: e novel Tornado, Indigo, and MAST modules achieve transparent heterogeneous execution on GPGPUs, SIMD units, and FPGAs respectively, without sacri cing productivity (Sections 2.3.3, 2.3.2, 2.5). We showcase that capability by accelerating KinectFusion on GPGPUs, SIMD units, and FPGAs under the same infrastructure.
• Enables co-designed architectural research on power, performance, and resiliency techniques via high-performing simulators and real hardware: Along with a plethora of real hardware, Beehive integrates a number of high-performing simulators in a uni ed framework (Section 2.6). We showcase this capability by providing a novel hardware/so ware co-designed optimization for KinectFusion.
• Supports dynamic binary optimization techniques via instrumentation and optimization at the system and chip level: Beehive extends its research capabilities to novel micro-architectures by providing dynamic binary instrumentation and optimization techniques for all supported hardware architectures (Section 2.4).
e paper is organized as follows: Section 2 explains the architecture of Beehive along with its individual components. Section 3 presents the Computer Vision application that forms the use case in this paper. Section 4 presents the various co-designed optimizations applied to the selected application along with their correspondent performance evaluations. Finally, Sections 5 and 6 present the related work, the concluding remarks and the future vision of Beehive, respectively.
BEEHIVE ARCHITECTURE 2.1 Overview
Beehive, as depicted in Figure 1 , follows a multi-layered approach of highly co-designed components spanning from the application down to the hardware level.
e design philosophy of Beehive revolves around ve pillars:
(1) Rapid prototyping for developing full-stack optimizations e ciently by using high-level programming languages. (2) Diversity for tackling multiple application domains, programming languages, and runtime systems in a uni ed manner. research amongst them. Beehive supports both managed and un-managed languages as explained in Subsection 2.3. Finally, applications can execute either directly on hardware, in-directly on hardware using a dynamic binary optimization layer, or inside Beehive's simulator stack. e following subsections explain in detail each layer of Beehive along with the supported applications, programming languages, and hardware platforms.
Applications
Beehive targets a variety of applications in order to enable co-designed optimizations in numerous domains. Whilst compiler and micro-architectural research traditionally uses benchmarks such as SpecCPU [56] , SpecJVM [57] , Dacapo [14] , PARSEC [10] , Beehive also considers complex emerging application areas.
e two primary domains targeted by Beehive are Computer Vision applications and algorithms such as KinectFusion [47] and other SLAMs (Simultaneous Localization and Mapping algorithms) along with Big Data so ware stacks such as Spark [7] , Flink [5] , and Hadoop [6] . To showcase Beehive, we selected an implementation of KinectFusion to be the main vehicle of experimentation.
Recent advances in real-time 3D scene understanding capabilities can radically change the way robots interact with and manipulate the world. A proliferation of applications and algorithms, in recent years, have targeted real time 3D space reconstruction both in desktop and mobile environments [1, 2, 47] . To assess both the accuracy and performance of the proposed optimizations, we use SLAMBench [46] a benchmarking suite that provides a KinectFusion implementation. SLAMBench harnesses the ICL-NUIM dataset [30] of synthetic RGB-D sequences with trajectory and scene ground truth for reliable accuracy comparison of di erent implementations and algorithms. SLAMBench currently includes implementations in C++, CUDA, OpenCL, and OpenMP allowing a broad range of languages, platforms, and techniques to be investigated. In Section 3, SLAMBench is explained and decomposed to its key kernels.
Runtime Layer
Some of the key features of Beehive are found in its runtime layer, which provides capability beyond simply running native applications. One of the challenges when designing such tightly co-designed systems is the application and programming languages support. Supporting numerous runtimes with various back-ends and compilers, while seamlessly integrating them with the lower layers of the computing stack, is a time consuming task which impedes the maintainability of the whole platform.
ese issues in turn will manifest in slow adoption of state-of-the-art so ware and hardware components and applications.
In order to overcome these challenges, we have taken the design decision to build the runtime layer around two components: the Java Virtual Machine (JVM) and native C/C++ applications. Despite being able to execute native C/C++ applications (regardless of the compiler used), Beehive has been designed to target languages that can run, and be optimized, on top of the JVM. e advent of the Graal compiler [24] along with the Tru e AST interpreter [64] enables the execution of multiple existing 2 , and novel, dynamic and non-dynamic programming languages and DSLs on top of the JVM. Building the Beehive platform around Tru e, Graal, and the JVM, we achieve high performing execution of a variety of programming languages in a uni ed manner. Furthermore, the amount of maintenance necessitated is contained to two compilers and one runtime system. In addition, any changes from the open sourced Graal and Tru e projects can be down-streamed to Beehive; keeping it synchronized with the latest so ware components.
Regarding the runtime systems of Graal and Tru e, two design alternatives have been deployed. e rst route is the vanilla implementations running on top of OpenJDK. e bene ts of this approach is that Beehive can be utilized by industrial-strength, high-performing systems that run on top of OpenJDK. is, however, has a number of drawbacks. Components of the runtime layer such as Object Layouts, Garbage Collection (GC) algorithms, Monitor Schemes, etc., are di cult to research due to the lack of modularity in Open-JDK. To that end, we decided to add an additional runtime layer for Graal and Tru e: the Maxine Research Virtual Machine [63] .
e MaxineVM, a meta-circular Java-in-Java VM developed by Oracle Labs, has been adopted and augmented for usage in Beehive [38] . Since its last release from Oracle, it has been enhanced by the Beehive team both in performance and functionality terms (Section 2.3.1).
e Graal compiler ported on top of MaxineVM has been stabilized and its performance has been improved making MaxineVM the highest performing research VM (Section 2.3.1). In addition, as depicted in Figure 1 , both MaxineVM and OpenJDK use the same optimizing compiler accompanied by the Tru e AST interpreter enabling Beehive to extend its research capabilities from industrial strength to high-quality research projects.
e multi-language capabilities of Beehive have been further augmented by novel so ware components that enable heterogeneous execution of applications on numerous hardware devices; Indigo, Tornado, and MAST [38, 45] Figures 2 and 3 illustrate the performance of MaxineVM in x86 and ARMv7 on Dacapo9.12-bach [14] and SpecJVM2008 [57] respectively. As illustrated in Figure 2 3 , since Oracle's last release (MaxineGraal-rev.20290 Original), performance has been increased by 64% (Maxine-Graal-rev.20381 Current) while currently Maxine is half of the performance of industrial strength OpenJDK with the C2 and Graal (rev. 21075) compilers. e target is to get the JIT performance of both VMs on par by enabling more aggressive Graal optimizations in Maxine such as escape analysis [59] and other compiler intrinsics. Unfortunately, we could not compare against JikesRVM [3] since it can not run the Dacapo9.12-bach benchmarks on x86 64.
Regarding ARMv7, as depicted in Figure 3 4 the performance of MaxineVM-ARMv7 falls between the performance of OpenJDK-Zero and OpenJDK-1.7.0-(Client, Server). MaxineVM outperforms OpenJDK-Zero by 12x on average across SpecJVM2008 5 , while it is around 0.5x and 0.3x slower than the OpenJDK-1.7.0 client and server compilers respectively. As in x86, many optimizations both in the compiler and the code generator, will be implemented and/or enabled in 3 Intel(R) Core(TM) i7-4770@3.4GHz, 16GB RAM, Ubuntu 3.13.0-48-generic, 16 iterations, 12GB heap. 4 Samsung Chromebook, Exynos 5 Dual@1.7GHz, 2GB RAM, Ubuntu 3.8.11, 2GB heap. 5 Serial was excluded from the evaluation.
IR Graph

Methods
Java Virtual Machine
Graal Compiler
Runtime
Indigo Library
Indigo SIMD Compiler
Front End Indigo Node Expansion
Back End
Register Allocation SIMD Generation
Mid End Optimizations Invocation Plugin
Figure 4: Indigo's interaction with the Graal compiler.
order to match the performance of the industrial strength OpenJDK.
Regarding the memory manager (GC), various options are being explored ranging from enhancing Maxine VM's current GC algorithms to porting existing state-of-the-art memory management components. Currently MaxineVM supports semi-space and generational schemes.
2.3.2 Indigo. Indigo, a novel component of Beehive, is an extension plugin for Graal that provides e cient execution of short vector types, commonly found in Computer Vision applications, and support for SIMD execution. While Indigo was initially designed to enhance the performance of computer vision applications, it can be easily expanded to provide generic vectorization support in Graal; a feature which is currently missing from public distribution. Figure 4 outlines how Indigo operates with the Graal compiler.
As depicted in Figure 4 , Indigo uses Graal's invocation plugin which enables the custom addition of a node in Graal's Intermediate Representation (IR). is, in turn, can be exploited by Indigo to re-direct the compilation route from Graal to Indigo and use its compilation stack to compile and optimize for SIMD execution. Within Graal, the IR is maintained as a structured graph with nodes representing actions or values while edges represent their dependencies.
e graph is initially generated by parsing the bytecode from a class le.
e objective of vectorization is to reduce the distance between vector operations in the IR enabling further optimizations through virtualization (i.e. escape analysis and scalar replacement [59] ). With the use of virtualization, we can maintain temporary vectors entirely at the registers of the targeted architectures. e addresses of the vectors are being used for reading and writing, enabling us to break free from the primitive Java types and, more importantly, from the use of Java arrays. However, since this is not an inherent safe usage of the Java semantics we made the following assumptions:
• Hardware supports 128-bit vector operations, true for ARM NEON and Intel SSE implementations.
• e class contains four single-precision oating point numbers suitable for vector operations of SLAM applications.
• Unused elements of a vector are zero.
• e elements of a vector are contiguous in memory.
• Once constructed, a vector is immutable. e aforementioned assumptions apply to the library provided by Indigo and in turn allow some of the restrictions in Java to be eliminated. is enables the IR to be extended and optimized more aggressively since the semantics are now within the vector abstraction and not within the general purpose language.
Invocation plugins allow the replacement of a method invocation with a sub-graph created during the graph building phase in Graal. We used a single node plugin that contains its own domain speci c compiler stack. e major bene t of this approach is the runtime independence from Graal. erefore, it can be downloaded and used a standalone library that, if the JVM uses Graal on top of the JVM Compiler Interface (JMVCI) [36] , SIMD instruction emission can be generated. Indigo's compiler stack contains a basic graph builder, optimizer, register allocator, and code generator with a scope limited for its target domain: Computer Vision applications.
Indigo nodes are generated either during the graph building phase of the compilation or indirectly during inlining. Once a graph has been constructed, it is transformed during the optimization phases by exploiting canonicalization and simpli cation to merge nodes. is allows us to maximize the number of operations in the node and eliminate new instance nodes (allocation of new objects) from the graph, leaving the data in registers. A simpli cation phase traverses 
Vector Matrix
Figure 5: Indigo's performance against Apache CML on common vector and matrix operations.
the operand edges of the Indigo node to detect other Indigo nodes and merges the internal operation graphs together. When Indigo nodes are lowered to the low-level IR (LIR) nodes used by Graal, they must claim virtual registers from Graal. At this point we lower the operation to a generic SIMD instruction to be scheduled while pro ling the register requirements. In order to maintain the vanilla implementation of Graal, we indirectly use its register allocator to provide general purpose and vector registers by claiming values to satisfy the requirements of the compiled method. Later, these will be converted into physical registers during the back end phases. e use of pro ling enables us to ofoad the allocation algorithms to Graal, while ensuring that no vector registers are spilled to the stack. is technique prohibits the JVM from entering un-recoverable states while being spatially more e cient. anks to the modularity of Graal, and access to the compiler through the JVMCI, it is possible to insert novel nodes into the compiler at runtime. With Indigo we show that it is possible to add a domain speci c compilation plugin to augment the Graal compiler.
is allows us to bypass all Graal internals and emit machine code exploiting SIMD instructions that are unsupported in the publicly available Graal. While this approach targets idiomatic SIMD for Computer Vision, there is no technical reason why it cannot be extended to insert other domain speci c knowledge into Java. Figure 5 , contains Indigo's relative performance against the Apache Common Mathematics Library (CML) [62] for a total number of 13 vector and matrix operations commonly found in Computer Vision applications. As depicted in Figure 5 , Indigo outperforms Apache CML both in vector and matrix operations. As expected, the largest gains are observed in matrix operations with matrix-vector multiplication exhibiting a 66.75x speedup. e observed performance improvements derive from the use of SIMD execution along with the compiler optimizations provided by Indigo (null check elimination, scalar replacement, etc.).
Tornado.
Tornado, a novel component of Beehive, originated by JACC [20] , is a framework designed to improve the productivity of developers targeting heterogeneous hardware. By exploiting the available heterogeneous resources, they have the potential to improve the performance and energy-e ciency of their applications. e key di erence between Tornado and existing programming languages and frameworks is its dynamism; developers do not need to make a priori decisions about their hardware targets. e Tornado runtime system achieves transparent computation o oading with support for automatic device management, data movement, and code generation. is is possible by exploiting the design of VMbased languages: Tornado simply augments the underlying VM with support for OpenCL by using the JVMCI (Java Virtual Machine Compiler Interface); similarly, to Indigo. e JVMCI allows e cient access to low-level information inside the JVM, such as a methods bytecodes and pro ling information. Using this information Tornado is able to JIT compile Java bytecode to execute on OpenCL compatible devices.
As depicted in Figure 6 , the Tornado API provides developers with a task-based programming model. In Tornado, a task can be thought of as being analogous to a single OpenCL kernel execution. is means that a task must encapsulate the code it needs to execute, the data it should operate on, and some meta-data. e meta-data can contain information such as the device it should execute on or pro ling information. e mapping between tasks and devices is done at a task-level granularity; meaning each task is capable of being executed on a di erent piece of hardware. ese mappings can be provided either by the developer or by the Tornado runtime; the mappings are dynamic and have the ability to change anytime.
Instead of focusing on scheduling individual tasks, Tornado allows developers to combine multiple tasks together to form a larger schedulable unit of work (called a task-graph).
is approach has a number of bene ts: rstly, it provides a clean separation between the code which co-ordinates tasks execution and the code which performs the actual computation; and secondly, it allows the Tornado runtime system to exploit a wider range of runtime optimizations. For instance, the task-graph provides the runtime system with enough information to determine the data dependencies between tasks. By using this knowledge, the runtime system is able to exploit any available task parallelism by overlapping the execution of task execution and data movement. It also provides the runtime system with the ability to eliminate any unnecessary data transfers that would occur because of read-a er-write data dependencies between tasks.
To increase developer productivity, Tornado is designed to make o oading computation as transparent as possible. is is achieved via its runtime system which is able to automatically schedule data transfers between devices and handle the asynchronous execution of tasks. Moreover, the JIT compiler provides support for user-guided parallelization. e result is that developers are able to rapidly develop portable heterogeneous applications which can exploit any OpenCL compatible device in the system.
Binary Instrumentation Layer
Beehive integrates a number of binary instrumentation tools to enable research and rapid prototyping of novel microarchitectures and ISA extensions. Along with the well-established Intel's PIN tool [43] , Beehive integrates the newly introduced MAMBO [27] , and MAMBO-x64 [22] tools for ARMv7 and AArch64 architectures.
MAMBO.
MAMBO is a low-overhead dynamic binary instrumentation and modi cation tool for the ARM architecture which currently supports ARMv7 and the AArch32 execution state of ARMv8. In the context of Beehive, the initial performance of MAMBO has been further improved since its rst release. e introduced optimizations include:
• A novel scheme to enable hardware return address prediction for dynamic binary translation.
• A novel so ware indirect branch prediction scheme for polymorphic indirect branches.
• A number of micro-architectural speci c optimizations such as usage of huge pages for internal data.
While the initial version of MAMBO achieves a geometric mean overhead of 28% on a Cortex-A9 (a dual-issue out-oforder superscalar processor with 8 to 11 pipeline stages) and of 34% on a Cortex-A15 (a triple-issue out-of-order superscalar processor with 15 to 24 pipeline stages), the introduced optimizations reduce the overhead on the two systems to 15% and 21% respectively.
MAMBO-X64.
e introduced ARM AArch64 architecture is a 64-bit execution mode with a new instruction set which retains binary compatibility with ARMv7 32-bit execution mode. Due to the need to support the large number of existing 32-bit ARM applications, current implementations of AArch64 processors include hardware support for ARMv7. However, this support comes at a cost in hardware complexity, power usage, and veri cation time.
MAMBO-X64 is a dynamic binary translator which executes 32-bit ARM binaries (both single-threaded and multithreaded) using the AArch64 instruction set. e integration of MAMBO-X64 into Beehive creates a path for experimentation for future processors to drop hardware support for the legacy 32-bit instruction set while retaining the ability to run ARMv7 applications.
In the context of Beehive, the performance of MAMBO-X64 has been further improved by employing a number of novel optimizations such as: mapping ARMv7 oating-point registers to AArch64 registers dynamically, generating traces that harness hardware return address prediction, and eciently handling operating system signals. A er applying the aforementioned optimizations, on SPEC CPU2006 [56] , we measured a very low geometric mean average performance overhead of 0.2%, 3.3% and 8.3% on X-Gene, Cortex-A57, and Cortex-A53 processors respectively. e performance of MAMBO-X64 also scales to multi-threaded applications, with an overhead on the PARSEC [10] multithreaded benchmark suite of only 2.1% with 1, 2 and 4 threads, and 4.9% with 8 threads.
Hardware/FPGA Layer
As depicted in Figure 1 , Project Beehive targets a variety of hardware platforms and therefore signi cant e ort is being placed in providing the appropriate support for the compilers and runtimes of choice. Besides targeting conventional CPU/GPU systems, it is also possible to target FPGA systems such as the Xilinx Zynq ARM/FPGA System on Chip (SoC).
In order to e ciently program FPGAs from high level programming languages, we developed MAST: a Modular Acceleration and Simulation Technology. MAST consists of a hardware/so ware library and tools allowing the rapid development of systems using ARM based FPGAs. From the hardware perspective it consists of a standardized interface which allows IP blocks to be identi ed and locked for use by processes running on the ARM processor. All IP blocks feature an AXI slave port, used for con guration and low speed communication, and optionally an AXI master port to provide high speed access to the system memory of the ARM processor, typically via the ACP port to provide cache coherency. Currently hardware design is carried out using Bluespec System Verilog [8] , with interface modules conforming to the hardware. e so ware library, which is entirely in user space, provides a hardware manager which can be used to discover IP on the programmable logic and allocate it a speci c process thread. e so ware library also provides a simple interface with IP blocks between the virtual memory world of the processor and the physical memory required by the hardware, where either the library or the host application can perform memory allocation.
Simulation Layer
Besides running directly on real hardware, Beehive o ers the opportunity to conduct micro-architectural research via its advanced simulation infrastructure. e two simulators of choice, with diverse characteristics, ported to the Beehive platform are: Gem5 [11] and ZSim [52] . While Zsim o ers a fast and high accurate simulation time on x86 (≈ 10 MIPS in our experiments), Gem5 provides a slower yet more detailed full-system simulation framework for numerous architectures. 2.6.1 Gem5. e Gem5 full-system simulator has been adopted and augmented in the following ways:
• Integration with other architectural simulators:
A new interface layer has been developed within the Gem5 full-system simulator [12] to facilitate easy integration with a range of architectural simulators as depicted in Figure 7 . e statistics package has been augmented to allow statistics to be assigned to groups, speci ed at run-time and manipulated (output and reset) independently, without a ecting the total values of the statistics or requiring updates to the code base. is allows new architectural simulators to be invoked from within the Gem5 simulator by using standard C++ template code. Current simulators integrated into the Gem5 framework include: 1) McPAT [40] and Hotspot [33] : e power and temperature modelers provided by those tools are conjoined to provide accurate temperature-based leakage models. Power samples may be triggered from within the Gem5 simulator, at intervals between 10ns to 10us (allowing transient traces to be generated for benchmarks), and from within the simulated OS (allowing accurate power and temperature gures to be used within user space programs). ere is around a 10% simulation time overhead for temperature and power modelling with 10us samples.
2) Voltspot [66] : In order to measure Voltage noise events caused by power-gating or switching patterns in Multicore SOCs over realistic workloads, the Voltspot simulator has been incorporated into the framework. e additional statistics generated allow nanosecond timing of events to be recorded while using samples of courser granularity.
3) NVSim [23] : e non-volatile memory simulator NVSim has been incorporated into the simulation infrastructure. NVSim can be invoked by McPat (alongside the conventional SRAM modeling tool Cacti [42] ) allowing accurate delay, power, and temperature modeling of non-volatile memory anywhere in the memory hierarchy.
• Machine Learning and Data Analytics techniques:
e interface layer has also been used to allow machinelearning/data-analytics techniques to be incorporated within the simulation framework. Machinelearning techniques are used to analyze statistical pa erns in the data aiding in the creation of hardwarepredictors for power-management, prefetching, branchprediction etc. e statistics package allows for the speci cation of features at runtime. Features are dened as a statistic over a given period (e.g. the branch mispredict rate over 1us, or the L2 cache miss-rate over 10ms). Features are speci ed at run-time and can be accessed periodically or triggered from events within the simulator and the statistics package guarantees to return the features over their speci ed time (within an error range which is also set at run-time).
e FEAST toolkit [15] has been incorporated into the framework (Figure 7 ) to allow for (o ine) feature selection. Packages for online K-nearest neighbour (KNN) and Support Vector Machine regression have been incorporated into the framework to allow for online prediction once the features have been chosen. Interaction between the simulator and the predictors is controlled by the statistics package again allowing for the prediction to be triggered within the Gem5 simulator code or from within the simulated OS.
• Resiliency and Fault-Injection: A critical aspect of any computer system is its dependability evaluation [37, 39, 58] . e accurate identi cation of vulnerabilities assists computer architects to carefully plan for low cost and high energy e cient resiliency mechanisms. On the contrary, inaccurate dependability assessment o en results on overdesigned microprocessors impacting negatively timeto-market and product costs. To aid dependability studies, we developed a fault injection framework that adheres to the following principles: 1) Flexibility: easy to setup, de ne and perform fault injection experiments, 2) Reproducibility: enable reproducible experiments, 3) Generality: support a wide set of ISAs in a uniform way performing comparative studies, and 4) Scalability: easily deployed to multi-core designs. Figure 8 depicts the oor-plan of the fault injection tool. e developed fault injection framework is built on top of Gem5 and operates as follows: A user-de ned test scenario is translated into a set of fault injection arguments using a simulator-speci c API. e injection library implements all the necessary simulation calls: (i) fault model(): setup of a transient, intermi ent or permanent fault model [13, 21, 44] . Transient faults are modeled by ipping the value of a randomly selected bit in a randomly selected time window within simulation. Intermi ent faults are modelled by se ing the state of storage elements to one (stuck-at-1) or zero (stuck-at-0), in a randomly selected time window, for a random period. Moreover, permanent faults set the state of storage element persistently to one or to zero. Finally, multibit fault injections, having a combination of the aforementioned models, are also supported. (ii) apply(): injects the faults into a user-de ned location (e.g. L1, L2 cache, etc.); and (iii) monitor(): logs and clusters the fault injection output. Finally, the injection controller, the kernel of the framework, communicates with the injection library and orchestrates the actual fault injection based on the user-de ned arguments. 2.6.2 ZSim. e ZSim simulator, a user-level x86-64 simulator with an OOO-core model of the Westmere (Nehalem) micro-architecture, has been augmented in order to run managed workloads on MaxineVM resulting in the MaxSim simulation platform [50] . Alternative options such as the Sniper [18] simulator that runs with JikesRVM [54] , or the full-system Gem5 simulator were considered but abandoned due to a number of limitations: Sniper can only run in a 32-bit mode, while Gem5 has a relatively low simulation speed. Finally, in order to perform energy and power estimations, we integrated the McPAT [41] tool into the ZSim simulator following the methodology proposed by the Sniper simulator [32] .
e methodology necessitated the implementation of a number of extra microarchitectural events in ZSim such as the number of predicted branches and oating point micro-operations.
SLAM APPLICATIONS 3.1 KinectFusion
To showcase the capabilities of the ZZZ platform, we focused on emerging applications which are becoming signi cant both in desktop and mobile domains: real-time 3D scene understanding in Computer Vision. In particular, we investigate SLAMBench a complex Simultaneous Localization and Mapping (SLAM) application which implements the KinectFusion (KFusion) algorithm. SLAM applications are challenging due to the amount of computation needed per frame and the programming complexity of achieving high performing implementations. SLAMBench allows the reconstruction of a three-dimensional representation from a stream of depth images produced by a RGB-D camera ( Figure  9 ), such as the Microso Kinect. Typically, the slower the frames are processed, the harder it is to build an accurate model of the scene. Each of the depth images is used as input to the six-stage processing pipeline shown in Figure 10 :
• Acquisition obtains the next RGB-D frame; either from a camera or from a le.
• Pre-processing is responsible for cleaning up the incoming data using a bilateral lter and standardizes the units used for measurement.
• Tracking estimates the new pose of the camera; it builds a point cloud from the current data frame and matches it against a reference point cloud, produced from the raycasting step, using an iterative closest point (ICP) algorithm. • Integrate fuses the current frame into the internal model, if a new pose has been estimated.
• Raycast using raycasting the pipeline can construct a new reference point cloud from the internal representation of the scene.
• Rendering this stage uses the same raycasting technique to visualize the 3D scene.
It should be noted that the pipeline has a feedback loop. Each of the pipeline stages is composed from a number of di erent kernels. In the original KinectFusion implementation, a kernel represents a separate region of code which is executed on the GPU. In a typical pipeline execution KinectFusion will execute between 18 and 54 kernels (best and worst case scenarios).
e variation is dependent on the performance of the ICP algorithm, if it is able to estimate the new camera pose quickly then less kernels will be executed.
is means that to achieve a real-time performance of 30 frames per second, the application will need to sustain the execution of between 540 and 1620 kernels every second.
Programmability Vs. Performance
SLAMBench o ers baseline and high-performing implementations of KinectFusion in C++, OpenMP, CUDA, and OpenCL. In order to achieve the QoS targets of Computer Vision (typically over 30 FPS), KinectFusion has to be heavily parallelized on GPGPUs and therefore the CUDA and OpenCL implementations are those matching the required targets. Developing on CUDA or OpenCL, however, comes with a number of drawbacks. e rst one is code complexity and productivity while the second one is portability since applications have to recompiled and tuned for each target hardware platform.
To tackle the aforementioned problems and to showcase the capabilities of ZZZ, we decided to experiment with Computer Vision applications in Java; a language that up-to-now was not considered for such high performing and demanding applications. Implementing SLAMBench, and in general Computer Vision applications, in Java provides a trade-o between programmability e orts and performance.
While Java can provide rapid prototyping, in contrast to writing OpenCL or CUDA, vanilla and un-optimized implementations can not meet the QoS requirements. We use the Java programming language as a challenge in order to build and optimize Computer Vision applications aiming to achieve real-time 3D space reconstruction. A er having developed and validated a serial implementation of SLAMBench, we performed a performance analysis and identi ed performance bo lenecks. en, we utilized ZZZ to apply a number of co-designed acceleration and optimization techniques to the various stages of SLAMBench. e acceleration techniques span from custom FPGA acceleration of certain kernels to full-application acceleration through co-designed object compaction and GPGPU o -loading.
EVALUATION
e following subsections describe the acceleration and optimizations techniques applied to SLAMBench via the Beehive platform along with the experimental results. e hardware and so ware con gurations for each optimization are presented in Table 1 .
GPU Acceleration
GPU acceleration has been applied to SLAMBench through Tornado (Section 2.3.3). All kernels but one 6 of KinectFusion have been dynamically compiled and o oaded for GPGPU execution through OpenCL code emission. Figures 11 and  12 , illustrate the performance and speedup of the accelerated KinectFusion version respectively.
As depicted in Figure 11 , the original validated version of KinectFusion can not meet the QoS target of real-time Computer Vision applications (0.71 FPS on average). Both the serial versions of Java and C++ perform under 3 FPS with the C++ version being 3.3x faster than Java. By accelerating KinectFusion through GPGPU execution we manage to achieve a constant rate of over 30 FPS (31.07 FPS) across all frames (802) from the ICL-NUIM dataset [30] (Room 2 conguration). In order to achieve 30 FPS, all kernels have been accelerated by up to 861.26x with an average of 43.37x across the whole application, as depicted in Figure 12 . By utilizing Beehive and its GPU acceleration infrastructure, we manage to accelerate a simple un-optimized serial Java version of a 6 Acquisition can not be accelerated because the input is serially obtained from a camera or a le. KinectFusion algorithm meeting its QoS requirements in a transparent to the developer manner.
FPGA Acceleration
FPGA acceleration has been applied to SLAMBench through the MAST acceleration functionality of Beehive (Section 2.5).
In the context of our initial investigation into FPGA acceleration we have selected the pre-processing stage that contains two computational kernels that: i) scale the depth camera image from mm to meters, and ii) apply a bilateral lter to produce a ltered scaled image. A lter is applied to the scaled image in order to reduce the e ects of noise in depth camera measurements.
is includes missing or invalid values due to the characteristics of the 3D space 7 .
In order to improve the execution time in Java, we merged the two routines into a single routine reducing the streaming of data to and from the FPGA device. e o oading to the FPGA is accomplished by using the Java Native Interface (JNI) mechanism to interface with our MAST module (Section 2.5).
e JNI stub extracts C-arrays of oating point values from the Java environment that represent the current input raw depth image from the camera, and the current output scaled ltered image. e JNI stub, in turn, converts the current raw depth image into an array of short integers which is memory allocated (through malloc) on rst execution of the JNI stub. e FPGA hardware environment is also initialized during rst execution, and consequently the hardware performs the merged scaling and ltering operation. Subsequent executions only need to perform a call to extract C-arrays and to, nally, release the output scaled and ltered image array back to the Java environment. As depicted in Table 2 , FPGA acceleration improves performance by 43x and 22x on MaxineVM and OpenJDK respectively. e di erence in both execution times and speedups from both VMs stem from the fact that OpenJDK produces more optimal code than MaxineVM (Section 2.3).
HW/SW Co-Designed Object Compaction
is generic optimization applies to all Java objects and regards class information elimination from object headers. is is achieved by utilizing tagged pointers; a feature currently supported by ARM AArch64 [26] and SPARC M7 [55] . In order to apply that optimization, changes both at the Virtual Machine and at the hardware layers have to be performed. In our case, it has been applied to SLAMBench through the Maxine/ZSim stack [51] (Section 2.6.2).
Object-oriented programming languages have the fundamental property of associating type information with objects allowing them to perform various tasks such as virtual dispatch, introspection, and re ection. Typically, this is performed by maintaining an extra pointer per object to its associated type information. To save that extra heap space per object, we utilize tagged pointers in order to encode class information inside object addresses. By extending ZSim to support tagged pointers in x86 and by extending the Address Generation Unit (AGU) at the micro-architectural level we managed to expose tagged addresses at the JVM level. Instead of maintaining the extra pointer per object, we exploit the unused bits of tagged pointers to encode that information. e proposed optimization, which is orthogonal to any application running on top of the JVM, has been applied to SLAMBench and results are shown in Figures 13 and 14 .
As depicted in Figure 13 , by employing the co-designed optimization for eliminating class information from object headers we managed to achieve up to 1.32x speedup with an average of 1.10x across all stages of SLAMBench. Furthermore, as depicted in Figure 14 the optimization resulted in up to 27% Dynamic DRAM energy, 12% total DRAM energy, and 5% total dynamic energy reductions. e energy reductions correlate with improvements in cache utilization of 24% and 25% in L2 and L3 caches respectively. e observed benets of employing the introduced optimization derive from the fact that by compressing object sizes by one word we managed to: 1) improve cache utilization, 2) reduce garbage collection invocations (from 10 to 7) due to heap savings, and 3) improve retrieval time for class information due to the introduced minimal hardware extension.
RELATED WORK
Although heterogeneity is the dominant design approach, its programming environment is extremely challenging. Delite [16, 19] is a compiler and runtime framework for parallel embedded domain-speci c languages [60, 61] . Its goal is to facilitate heterogeneous programming to e ciently exploit the underlying heterogeneous hardware capabilities. SWAT [29] is a so ware platform that enables native execution of Spark applications on heterogeneous hardware. Furthermore, OpenPiton [9] is an open source many-core research framework covering only the hardware layer, X-Mem [28] is an open-source so ware tool that characterizes the memory hierarchy for cloud computing, and Minerva [49] is a HW/SW co-designed framework for deep neural networks. In contrast to the aforementioned approaches, the Beehive framework is a hardware/so ware experimentation platform that enables co-designed optimizations for runtime and architectural research. covering all applications and compute stack. Regarding GPGPU Java acceleration, a number of approaches such as APARAPI [4] , Ishizaki et. al. [35] , Rootbeer [48] , and Habanero-Java [31] , exist. Beehive's Tornado module di ers due to its dynamic nature and its co-operation with other parts of the framework such as MAST.
CONCLUSIONS AND FUTURE WORK
In this paper, we introduced Beehive: a hardware/so ware co-designed platform for full-system runtime and architectural research. Beehive builds on top of existing state-of-theart as well as novel components at all layers of the platform.
By utilizing Beehive, we managed to accelerate a complex Computer Vision application in three distinct ways: GPGPU acceleration, FPGA acceleration, and by compacting objects in a hardware/so ware co-designed manner. e experimental results proved that we managed to achieve real-time 3D space reconstruction (>30 fps) of the KFusion application, a er accelerating it by up to 43×. Our vision regarding Beehive is to improve both its integration and performance throughout all the layers. In the long term, we aim to unify the platform's components under a semantically aware runtime increasing developer productivity. Furthermore, we plan to de ne a hybrid ISA between emulated and hardware capabilities. is ISA will provide a roadmap of movement of interactions between abstractions o ered in so ware and in hardware. Finally, we plan to work on new hardware services for scale out and representation of volatile and non-volatile communication services. is will provide a consistent view of platform capabilities across heterogeneous processors for Big Data and HPC applications.
