Abstract. Mobile processors, a subclass of embedded processors, are increasingly employing multicore designs to improve performance. This often requires sacrificing resources in each CPU, degrading single thread performance which is still important according to Amdahl's law. The traditional technique for efficiently boosting serial performance in embedded processors, dedicated hardware acceleration, is unsuitable for modern mobile processors because of the heterogeneity and the diversity of applications they run. This paper proposes 'general purpose' accelerators, reconfigured on an application-by-application basis, as a means of increasing single thread performance. These accelerators are placed within the datapath of CPUs and support dynamic compilation. This paper presents the design of an architecture with such accelerators and evaluates the cost/performance implications of the design.
Introduction
Mobile processors, a subclass of embedded processors, are General Purpose Processors (GPPs) designed primarily for small, fan-less, battery powered, mobile computing devices such as smart-phones. They are characterised by high performance, low energy consumption, small area and low cost. Mobile processors are increasingly moving to multicore designs to improve performance.
Multicore processors, multiple Central Processing Units (CPUs) on a die, improve performance by handling more work in parallel. Increasing the number of CPUs often requires sacrificing resources in each CPU which degrades single thread performance. Single thread performance is still important as some key applications have limited Thread-level Parallelism (TLP). Further, according to Amdahl's law [6] , serial sections within a massively parallel application with lots of TLP are performance constraints. A current (and future) challenge for mobile processors vendors is how to efficiently increase single thread performance in these resource-constrained processors. Unfortunately, the time-tested approach for serial performance improvement in embedded processors -accelerating compute intensive parts of applications using dedicated hardware -is unsuitable for modern mobile processors because of the heterogeneity and diversity of applications. The next best alternative is having 'general purpose' hardware, reconfigured on an application-by-application basis to realise frequently occurring functions. This is less efficient in terms of area, cost and power than fixed hardware but allows a GPP to be specialised based on the application it's currently running.
Reconfigurable hardware has been used successfully to accelerate single threads in experimental and commercial processors. However, employing it in multicore mobile processors poses two unique challenges.
A typical Reconfigurable Architecture (RA) is composed of several memory elements, programmable interconnect and an array of many Processing Elements (PEs) making its deployment prohibitive due to its significant area and power consumption. Further, mobile processors rely extensively on dynamic compilation, which is not yet common on RAs, to improve portability. Dynamic compilation is important as an increasing number of parallel programming systems rely on it to provide forward scaling [13] : applications that effectively scale with new core counts as well as the unavoidable augmentation and evolution of the instruction set. For instance, kernels (critical parallel functions) in Intel R Array Building Blocks (IABB) [13] are first compiled to a platform independent Intermediate Representation (IR) then dynamically compiled to binary via a Virtual Machine (VM) at run-time.
This paper presents the VIrtual REconfigurable Micro-ENgine for Translation (VIREMENT) , a mobile multicore processor employing general purpose accelerators to improve single thread performance. The general purpose accelerator is a Reconfigurable Functional Unit (RFU) placed within the datapath of each CPU. VIrtual REconfigurable Micro-ENgine for Translation (VIREMENT) supports dynamic compilation by providing a run-time library for generating reconfigurable instructions on-the-fly. Experiments show an average performance improvement of 133% (2.33×) with area overhead of 34% per CPU.
Related Work
Over the years, architectures that dynamically translate code to run on reconfigurable hardware have been developed. Such architectures eliminate dependencies on hardware features, letting hardware vendors significantly change features from one hardware generation to the next without breaking binary portability.
Warp [17] is a family of processors that automatically extracts and compiles kernels to Field-Programmable Gate Array (FPGA). A typical Warp processor is a System on Chip (SoC) with a main processor for executing applications, a less powerful processor on which a lean FPGA compiler runs, a profiler and a custom FPGA. It translates binary sequences to hardware transparently by profiling executing binary program, detecting critical regions, decompiling them, synthesising them to hardware, placing and routing them onto a custom on-chip FPGA, and updating the binary to call the hardware next time. However, its CAD algorithms, which run on a separate microprocessor, require significant resources as well as time to execute. The use of an FPGA limits it to a few
