Abstract: Performance and energy consumption are well-known constraints of modern embedded systems, thus, application analysis in early stages of the development cycle is mandatory. However, the few available tools to evaluate the behaviour of an application considering different architectures are not able to provide a complete solution for this task. In this context, this work presents a multi-architecture profiling tool for Android applications, which fully supports ARM, MIPS, and x86 architectures. It provides a wide range of information per application, including energy consumption, execution time and other statistics. For that, we have extended the Android Emulator QEMU and developed post-processing tools. As case study, we have compared different architectures in terms of performance and energy consumption. By the use of the proposed tool, we show that, given a fixed energy budget, a different amount of applications can be executed depending on how they were implemented, which varies according to the processor.
Introduction
Embedded systems operate in constrained environments in which memory, storage, power supply and processing capability are limited. As they are getting more complex and with more functionalities, a number of applications must run concurrently in an environment that is not optimised for performance, but rather for energy consumption, which must be kept as low as possible to maintain an acceptable battery life . For this reason, the use of profiling and monitoring tools is extremely important to help ensure these requirements are met at earlier stages of the development process.
In the case of Android based systems, applications may execute on several different devices. This adds an extra variable, since the application must be evaluated considering different target architectures. However, as it will be shown, the behaviour of an application (in terms of performance and energy) may vary considerably when one changes the target architecture. Therefore, the software designer must be aware of it during the application's development. The only possible solution available is the use of several different real devices to test the application, which would imply in a huge development time. In addition, new devices, with different specifications, are introduced to the market frequently. Hence, this approach is not adequate for the today's ever-changing market, as the effort for obtaining statistics on these applications rises greatly.
In order to address the aforementioned issue, we propose a new profiling tool, which is able to cover a wide range of information about applications' execution, aiding the developer to meet the application's requirements. We focus on Android, since it is the world's most popular mobile platform (IDC, 2013; Mawston, 2014) . Android, a Linuxbased operating system, is mainly used in smartphones, tablets and it also is a promising candidate to be used in other consumer electronics products, such as smart TVs, set-top boxes and smartwatches. Android uses a virtual machine, called Dalvik, to execute bytecodes from Java applications. It is also possible to execute native code through the Java Native Interface (JNI) or native activities.
In order to validate our tool, we investigated applications written in Java and those that use the JNI, measuring simulation time, number of executed instructions and so on. Moreover, as a case-study to illustrate how the proposed profiler can be used for design space exploration purposes, considering different project restrictions and available resources, we estimated energy consumption and performance for a set of benchmarks executing on different processor architectures with different organisations: ARM (Cortex-A8 and Cortex-A9) and Intel x86 (Atom and i7). We also use the proposed profiler to assess the impact of the Dalvik virtual machine on all officially supported Android architectures (i.e., ARM, MIPS and x86).
The remaining of this work is organised as follows. Section 2 discusses related works. Section 3 presents the implementation of the proposed tool. The benchmarks that are used in this work are discussed in Section 4. Sections 5 and 6 discusses experimental results, and Section 7 concludes this work.
Related work
Simulators are commonly used to mimic the expected behaviour of a given system or architecture. They usually are classified as instruction-level or cycle-accurate. Instruction-level simulators have, as their main advantage, the simulation speed. However, they usually produce a less accurate result. Cycle-accurate power simulators demand a complex description of the architecture, usually in RTL format, which is only available in later design stages, and require larger simulation times.
Several tools have been proposed to estimate performance and power consumption in early stages of the design cycle, at a higher level of abstraction. In (Tiwari et al., 1994) , the estimation is based on the average power consumption of processor instructions, obtained by experimentation. Other approaches (Dalal and Ravikumar, 2001; Choi and Chatterjee, 2001; Chen et al., 2001) propose estimators that, although also based on instruction-level simulation, consider specific models of power consumption for architectural components (functional units, register files, busses, memories, control units) that are affected by each instruction, according to the switching activity at the inputs of the components and to their physical properties. Finally, there are approaches that combine both instruction-based and component-based estimation models (Givargis et al., 2002; Stitt et al., 2000) . Šimunić et al. (1999) present a compiled-code cycleaccurate simulator that estimates the power consumption in an ARM-based system. Each system component (processor, memory, L2 cache, connections, and others) has its own power consumption model, based on the data used by the component in a given cycle. (Landman and Rabaey, 1996) present power estimation techniques in order to calculate the power dissipation in datapaths, control units, memory, and connections, also in a cycle-accurate way. In a similar approach, in (Liu and Svensson, 1994) one can find techniques to calculate the power consumption in a SOC (System on a Chip). Moreover, other approaches that focus on specific parts of the system have already been studied, like in (Wilton and Jouppi, 1996) , where an enhanced cycleaccurate cache access model was used, based on technology parameters and capacitance values derived from the technology files and netlists.
The works discussed so far focus on specific components of a system (e.g.: processors, ALUs, etc). On the other hand, the following focus on the system estimation as a whole. (Economou et al., 2006) propose a method for modeling full-system power consumption for servers based on components and application workload, providing realtime power prediction. (Kansal et al., 2010) propose power models based on power sensors from servers, to infer power consumption from resource usage at runtime on virtual machines. (Dutta et al., 2008 ) present an energy meter design, through a hardware modification and software driver, for systems that have a built-in switching regulator. The authors in (Fan et al., 2007) traced different classes of applications on large collections of servers for approximately six months. A modeling framework was used to estimate the potential of power management schemes to reduce peak power and energy usage. Other authors have been using Hardware Performance Counters in order to estimate the power dissipation of individual components, e.g., CPU (Snowdon et al., 2009; Bellosa, 2000) , CPU and memory (Contreras and Martonosi, 2005) , disk (Zedlewski et al., 2003) or even the entire system (Bircher and John, 2007) .
In the next subsections, we will present profiling tools specifically for the Android platform and compare their features with the proposed tool. We will consider the list of important features, discussed below, when analysing the related work.
1 Profile both Java and native code: As several applications use native code to increase the application's performance, either by implementing a part of the application in native code via JNI or native activities, or by using native libraries, it is mandatory that profiling tools support both kinds.
2 Profile separated by application: When developing an application, we are interested in getting data on the application that is being developed, and not of the entire system, which will have several other applications or background processes running at the same time.
3 Profile applications' hot spots: In order to guide the developer to changes that are needed to meet the application's requirements, it is also important to identify the application's hot spots (i.e., a method or sequence of code that takes too much time or spends too much energy).
4 Unlimited amount of profiled data: Being able to profile a large amount of information about the application is imperative, since an application may take a long time to be executed.
5 Estimate performance: Estimating the application's performance is critical in order to meet the application's requirements.
Estimate power dissipation and energy consumption:
Estimating the application's energy consumption is also essential, considering the limited battery capacity of current embedded systems.
7 Provide graphical interfaces and support reuse of data: Providing a graphical interface benefits the analysis of the collected data and the support for reusing data speeds up the development process. In addition to all of the aforementioned features, the proposed tool is based on the Android emulator; therefore, several devices can be easily emulated without the need for real devices. As we will show in the next subsections, even though there are several profiling tools available for Android, none of them covers the whole set of needed information that proposed tool is capable of covering.
Traceview
The Android Software Development Kit (SDK) includes software tools for debugging, profiling and monitoring. The Dalvik Debug Monitor Server (DDMS) that belongs to this kit contains Traceview (Android, 2013c) , which is a profiling tool that provides timeline and profile panels for a given application (feature 2) in a Graphical User Interface (GUI) (so it has partial support for the feature 7, because it does not reuse collected data). While the former shows when each thread starts and stops execution, the profile panel contains a summary of each method, with its name, children methods, the inclusive and exclusive execution times, and the number of calls to the method (features 3 and 5). The inclusive method execution time is the time spent in the current method plus all the called methods from it. The exclusive method execution time does not consider method calls. The trace process can be started and ended either by changing the application's source code; or by manually starting and stopping the trace as the application executes. This tool can be used to trace applications on several devices and architectures (features 8 and 9). However, it is very limited in the amount of data that it can trace. Because it stores all the traced data in a buffer, it depends on the amount of free RAM that is available in the mobile device. Occasionally, the buffer will overflow, and all data profiled after it will be lost. This was the case of most of the tested benchmarks (which will be further discussed), even considering the smallest ones running on devices with more than 1GB of RAM available. Moreover, besides not providing any information on power consumption, another limitation is that it can only trace methods of applications executed by the Dalvik Virtual Machine. Therefore, native code (through the JNI or Native Activities) cannot be profiled. If one considers that many libraries are natively written (e.g.: WebKit, used by the most popular web browsers), such feature is a must to the analysis process. Therefore, features 1, 4 and 6 are missing.
Modification of the DDMS
The DDMS, discussed above, is written in Java and runs as an Eclipse plugin, so it is not performance oriented. With that in mind, the tool proposed by Yoon (2012) has the goal of speeding up the whole profiling process. It is done by decomposing the Traceview into two parts: a log data processing layer and a Pretrace program layer, which creates and analyses the start and end times of methods. However, no quantitative information about the achieved speed up is given. Moreover, as it is based on Traceview, it has the same limitations previously mentioned. Cho et al. (2012) propose another performance analysis tool for the Android platform, called AndroScope. In this work, a low-level performance analysis through the Hardware Performance Counters (HPCs) of the processor is used to gather data of cache misses, CPU cycles and executed instructions. With this information, it is possible to obtain the Instructions Per Cycle (IPC) ratio, which is used as a performance factor (feature 5). The tool can trace both Java and native applications (feature 1), including Android libraries. However, an extended GCC compiler front-end must be used to automatically insert instrumentation codes to obtain the trace of native libraries and to provide runtime filtering by class, method name or signature, which allows a selective trace (features 2 and 3). The authors also developed a graphical user interface, based on Traceview, and they created a new layer to process massive trace logs (partial support for feature 7, and feature 4).
AndroScope
This tool has two main drawbacks. The first one is that the application code must be modified prior to its analysis, so all native libraries must be recompiled. The second drawback comes from the fact that using HPC registers dictates the use of real processors to collect data, making the tool highly dependent on the architecture and processor, precluding its use when it comes to evaluate the application considering multiple architectures. In this case, the Android Emulator could not be used, since the HPCs are not implemented. Hence, features 6, 8 and 9 are not supported and, although not being mentioned on the listed features, it is important to avoid modifying the application's code in order to profile it.
CPU cycle estimation
A research conducted by Fujitsu Laboratories Ltd. (Thach et al., 2012) proposed a methodology for an instruction-level CPU emulator. It is divided into a two-phase pipeline scheduling process. First, a static phase is conducted to obtain a rough estimation of the CPU cycle count with the purpose of reducing the instrumentation performance overhead. In this phase, the cycle costs of each basic block is estimated considering that all memory access leads to a cache hit and that any branch is correctly predicted. Then, the dynamic phase is responsible for refining the results, adding penalty cycles for the cache misses and branch mispredictions (which are detected at runtime). This methodology was implemented by modifying the QEMU source-code with an estimation error in the CPU cycle count of about 10%, when compared to a real CPU (features 5 and 2). For that, a cache simulator that models the L1 and L2 caches was built. The cache configuration was based on an ARM Cortex-A8 architecture. Besides the fact that the source code is not available and therefore it cannot be extended, it lacks several features (1, 3, 4, 6, 7, 8 and 9 are not supported), such as energy consumption estimation and support for multiple architectures.
MARSSx86
Patel et al. proposed another QEMU modification to perform full system simulation for multicore x86-based CPUs, called MARSSx86 (Patel et al., 2013) . MARSSx86 is a cycle accurate tool that performs detailed simulation of CPUs, caches and memory ( Figure 1) ; with support to features 1, 2, 4 and 5. Although it is an open source tool, it currently does not support architectures other than x86. This limits its field of application, considering that ARM is the main architecture used in mobile devices [90% of the smartphones market share (Trefis, 2013) ]. In the same way and with the same limitations, PTLsim (Yourst, 2007) , a cycle accurate x86 microprocessor simulator, was improved and ported to have integration with QEMU. Among several other restrictions, both tools do not provide power dissipation estimation. Therefore, features 3, 6, 7, 8, and 9 are missing. 
PowerTutor
PowerTutor (Gordon et al., 2011 ) is a power estimation system that uses the model generated by PowerBooter (Zhang et al., 2010 ) for online power estimation (feature 6). PowerBooter models the most significant components regarding power dissipation in the system, which are: CPU, LCD display, GPS, Wi-Fi, 3G, and audio interfaces. The power dissipation of these components are independently considered, which results in an error of 6.27%, according to the authors. The PowerBooter relies on knowledge of the battery discharge voltage curve and access to the battery voltage sensor, which is available in most smartphones.
Even though this approach provides an accurate estimation of the system as a whole, it may fail when it comes to estimate power of a standalone application: there are always several applications running concurrently (e.g.: background services); and PowerTutor itself runs concurrently with all other applications, which will also influence the results. Because it assumes that a given application is running alone in the system, when, in fact, several other applications are running concurrently, feature 2 is partially supported.
In addition, even though PowerTutor was implemented for the Android platform, the instruction profiles need to be created specifically for each smartphone model. To the best of our knowledge, only a few smartphone models are supported: HTC G1, HTC G2 and Nexus One; all other devices will use a generic model that may not estimate the power and energy consumption with the proper accuracy (partial support for feature 9). As the PowerTutor is an Android application, it also partially supports feature 7 for having a graphical interface. Features 1, 3, 4, 5 and 8 are not supported.
Sesame
Sesame (Dong and Zhong, 2011) generates energy models relying on system statistics, such as CPU timing and memory usage provided by Linux. It also uses the Advanced Configuration and Power Interface (ACPI), available on modern mobile devices, which provides platform-independent interfaces for power states of the hardware, including processor and peripherals. Sesame monitors the accuracy of the energy model in use and adapts it accordingly when the accuracy is below a certain threshold (feature 6). This tool was developed to run on any Linux-based mobile system, which may include laptops and smartphones (features 8 and 9).
To avoid incurring additional processing overhead on the system, sesame schedules the computation of its intensive tasks to be executed when the system is idle and connected to a power supply. Hence, only data collection and simple calculations are performed during real system usage. The main issue with Sesame is that it is not able to provide information for each application: the data collected is always from the system as a whole. Therefore, the developer cannot analyse the actual costs of the application under development, once several other applications may be running concurrently in the system. Hence, features 1, 2, 3, 4, 5 and 7 are missing.
Trepn profiler
Trepn (Qualcomm, 2013) profiles performance and power dissipation for Qualcomm Snapdragon processors (features 5 and 6). External scripts may be used in order to allow automated test environments. The collected data can be seen in real time and it can be exported for offline analysis (feature 7). The power information is obtained through specific hardware sensors that are present in these processors. Therefore, the tool is limited only to this family of processors (partial support for feature 9) and, as the tool previously discussed, cannot profile an application in particular (so features 1, 2, 3, 4 and 8 are not supported). Shye et al. (2009) proposed an experiment where Android G1 mobile phones with a logger were given to users in order to trace the user's activity and characterise the power consumption. They presented a regression-based model based on the data collected by their logger. As a case study, they reduced slowly the screen brightness and CPU frequency when the screen was active for a long time. This reduction in frequency and brightness resulted in 10.6% of total system energy savings with minimal impact on the user experience, according to the authors. Pathak et al. present a system call-based power modeling for smartphones, in which PowerMonitor (Moonson Solutions, 2014) was used to measure the energy consumption of the smartphones (Pathak et al., 2011) .
Other Android specific tools
More recently, the following power profilers were proposed: VPA (Tu et al., 2014) , EnTrack (Lee et al., 2015) , and a model-based energy profiler (Kamiyama et al., 2014) . In the first work, the QEMU is modified to perform systemwide power and performance profiling. The second proposes an energy profiler that incorporate the energy consumed by system services. Finally, the third proposes a power model for the Qualcomm's MSM8960 chipset. Therefore, none of these works is able to provide application-specific data as system-wide profiling is performed, in which, several background processes are being executed. In addition, the last two cannot be easily extended to other architectures and devices.
AndroidPerf (Xue et al., 2015) is a cross-layer profiling system, which also modifies que QEMU emulator, even though this profiler is able to trace both Java and native code, it does not provide any power or energy statistics. FEPMA (Kim et al., 2014) was developed by instrumenting the Android operating system to provide information about power and performance state changes. Based on this state change information, power models are used to estimate the energy consumption of a specific device. Hence, it is not portable to several architectures and devices, and it does not profile performance.
It can be noticed from previous works that none is able to provide the whole set of important features that we have previously defined. Table 1 presents a comparison between each presented tool and these features. 
Proposed profiler
The proposed tool was developed from scratch and it was built on top of the Android's QEMU (the official and largely used tool to develop and test Android applications), which is a QEMU version modified to run the Android emulator. The initial implementation, called AndroProf, is discussed in (Sartor et al., 2013) . In addition, the emulator uses an Android Virtual Device (AVD) to determine the device's configuration that will be emulated. An AVD is an emulator configuration that defines the configuration of software and hardware, so an actual device can be modeled/emulated (Android, 2013b) . This configuration can be modified prior to execution. For instance, it is possible to emulate a device running Android 4.2.2 on an ARM processor, with 512MB RAM, 1GB internal storage and 2GB SD card; or an Android 4.0.3 on a MIPS processor, with 256MB RAM, 512MB internal storage and 4GB SD card. The proposed profiler is available in Sartor (n.d.) . Figure 2 shows an overview of the tool's flow. We modified QEMU to trace information about the applications and we developed additional tools with graphical interfaces to process the collected data and to characterise instructions, which will be saved in a database (feature 7). In opposite to the existing tools, besides being capable of presenting information for each process (feature 2), it also allows one to see the hot spots of the code (feature 3). The hot spots are separated by basic blocks, from which is possible to identify the most executed methods. In addition, the proposed tool provides native and Java code profiling (feature 1) and allows the creation of categories to separate the instructions according to its average costs in power and in cycles. With this information, it is possible to estimate the total power consumption and performance (in number of cycles) of a particular application (features 5 and 6). Moreover, it is able to process a large amount of data (i.e. collect and analyse data of an execution that can take days) (feature 4), store the processed data in a database for future reuse and support all Android emulator architectures (i.e., ARM, MIPS and x86) (feature 8) and multiple devices (feature 9). It also has a selective trace mechanism, which allows the user to enable or disable the trace at any time.
More details about what was implemented will be discussed in the next subsections.
Instruction profiling
All modifications to the Android QEMU were developed by either creating solutions that were compatible among all the supported architectures or by creating specific ones for each architecture when it was not possible to implement a generic solution. QEMU is an instruction-level simulator that emulates a virtual hardware platform using different ISAs: ARM, MIPS and x86. This emulation is possible due to a dynamic binary translation mechanism. It translates, typically one basic block (BB) at a time, instructions from the guest to the host machine. However, as it is an instruction-level simulator, it does not model pipeline and memory accesses. As expected, it is slower than execution on real hardware, but not as slow and accurate as cycle-accurate simulators (Weaver and McKee. 2008) . As the host machine is based on the x86 ISA, so the x86 emulator can be directly executed on the target processor, due to its binary compatibility. ARM and MIPS codes, on the other hand, need to go through the binary translation process, as presented in Figure 3 . However, the needed profiling data is extracted exactly during the translation process. Therefore, binary translation mechanism must be active even when x86 code is executed. Because data is obtained at this level (at the emulated hardware level, below the operating system), the tool can gather information from both native and Java code.
To estimate performance and energy consumption, first the tool collects all the instructions that are executed by each of the applications and how many times each of these instructions were executed. It is done by executing the QEMU with specific options, such as the one which enables the logging of all basic blocks as they are translated. This logging mechanism was already available on the emulator and it was modified to insert a basic block identifier to each new basic block that comes for translation. In each translation, each basic block from the emulated architecture is saved into a log file ("BB log file" from Figure 2 ). Therefore, all basic blocks and, consequently, all instructions, can be profiled.
However, it is also necessary to identify which is the application that is running the current code, since there is the need to profile each application individually. Thus, to distinguish processes, we extended and used the context switching trace mechanism, implemented by the Open Handset Alliance (OHA), to get the Process Identifier (PID) of the current process for the ARM and MIPS ISAs, which is obtained by reading the device's emulated memory. For the x86 ISA, this same mechanism is not available; therefore, we had to modify the Linux kernel to save the PID of the current process to a debug register whenever a context switch happens.
A hash table indexed by the program counter (PC) of the basic block and the PID of the process that is executing this BB is responsible for saving how many times each basic block was executed by each process. For that, we modified the QEMU emulation flow, as shown in Figure 4 , which will be explained next. Every time there is a basic block to be executed, first it is verified if it is cached or not (i.e.: if it has already been translated before). If it has not been found in cache, QEMU translates it into a Translated Block (TB), which is a basic block composed of instructions implemented with the ISA of the host machine (x86 code). On the other hand, if the current BB is cached, its correspondent TB is loaded from cache, so there is no need to translate the BB again. This verification is done for each BB as the application executes. The process of reading a TB from the translation cache is slow. Therefore, QEMU implements a TB chaining mechanism, depicted in Figure 5 . QEMU tries to chain the current TB to the next TB that will be executed so it does not need to search and fetch the TBs one by one. However, this chaining mechanism had to be removed because it was needed to keep track of all BBs in the hash table.
We have also implemented a software cache for the hash table in order to speed up the simulation, which resulted in 7 times of speed up when compared to the version without the cache. The cache is direct mapped with 2,048 slots (so it can fit in the hardware cache), with an average hit ratio of 97% on the tested benchmarks. In addition, a selective trace mechanism was implemented. By pressing a shortcut on the emulator's GUI, the tracing is enabled or disabled. The selective trace affects the QEMU emulation flow: if it is disabled, the hash (and its cache) will not be updated. Therefore, during the time that no application of interest is running, the tracing can be disabled in order to reduce the overhead of the simulation. On the other hand, if the tracing is enabled, all running applications are traced and they are separated by the PID, as previously discussed. Finally, when the emulator's GUI is closed, all the hash table (PC, PID and counter for all entries) will automatically be saved to a file ("BB counter file", from Figure 2 ) for further processing.
Post-processing tools

Instruction categorisation GUI tool
Having all the executed instructions saved on two files by the emulator (Figure 2 ), this part of the tool is responsible for classifying them according to their category (e.g., arithmetic and logical, load/store, etc). It also creates a file that contains information on the instructions ("Instr. Info File" from Figure 2 ), which includes the possibility of categorising and classifying them according to the average number of cycles and power dissipation for each of the available categories. These categories can be customised in order to meet the developer's requirements. For instance, BEQ, BNE and BLT instructions may be inserted in the Conditional Branch category and so on. This GUI is presented in Figure 6 . The left table contains the instructions that do not have any category. These instructions, if any, will be classified as "Undefined" and they will have the default costs, which can also be customised. The right table contains the instructions of a given category, selected by the combo box of "Selected Category". It is also possible to create instruction information files for different ISAs (e.g.: ARMv7 or MIPS). It is also possible to define different instruction types within the same ISA (e.g.: Thumb or regular ARM instructions for the ARM ISA). Thumb instructions are a subset of the ARM ISA with reduced bit encoding size. Therefore, Thumb instructions need less memory than ARM instructions: they are 16 bits long, while ARM instructions are 32 bits long. As an example, let us consider the ARMv7 architecture. It has a conditional branch category with CPU cycle cost of 3 (ARM, 2013), with a power dissipation of 113 miliWatts (Bazzaz et al., 2013) for each instruction. This category comprises the following instructions: BEQ (ARM), BEQ (Thumb), BNE (ARM), BLT (Thumb), BLE (ARM), BGT (Thumb), BGE (Thumb) and other conditional branch instructions.
In order to make the process of creating a new categorisation file easier and faster, all instructions from a certain type can be selected and assigned to a new category in a single step. It is also possible to import existent category characterisation files to reuse or edit. The output file is saved in a XML file.
Analysis GUI tool
After tracing the application and categorising the instructions, it is time to import all these files in another tool developed for the proposed profiler. An analysis tool ("Analysis GUI" from Figure 2 ) with a GUI, presented in Figure 7 , imports both created files from QEMU (BB log and BB counter files from Figure 2 ) and the instruction categorisation file ("Instr. Info File" from Figure 2 ) and, after processing the data, presents the analysed data, which will be saved into a database. This database is necessary due to memory limitations, besides the obvious advantage of providing a way of loading previous saved architecture configurations. Some features of this GUI are:
• information about basic blocks, instructions and categories: total cycle and power dissipation costs and histograms (e.g.: Figure 8) • PID chart based on the total cycle or power dissipation cost of each PID. This feature allows seeing which are the costly processes
• performance and energy consumption estimation based on a given operation frequency and other reference data
• import profiles (instruction characterisation) for different instruction set architectures, or variations (i.e.: different organisations) of these ISAs.
It is also possible to run a set of applications automatically with the scripts that are provided. Therefore, a set of benchmarks can be executed in sequence on the emulator and their data will be automatically saved. Since the designer can easily know the PID of each application, it is possible to evaluate each process separately. 
Benchmarks
To the best of our knowledge, there is no set of benchmarks written in Java, with correspondent versions that have their hot spots (most time consuming methods) implemented in native code, using the JNI to call these native methods from the Java side. Therefore, we developed these applications based on a sub-set of the JVM SPEC 2008 (SPEC, 2013), which was originally developed for measuring performance of the Java Runtime Environment. This resulted in a benchmark set composed of 12 Java applications only. Then, we profiled these 12 applications in order to identify the hot spots of the code. Based on this, the methods that executed for more than 10% of the application's total execution time were converted to native code, which will be called through the JNI. For some benchmarks, just one method is responsible for more than 90% of this total; while for others, up to five methods were necessary to achieve this rate. From these 12 applications, 7 fit the aforementioned constraint (at least one method with 10% or more of the execution time), and the correspondent methods were converted to native code. Therefore, we have two benchmark sets:
• The Java Benchmark Set: 12 benchmarks written in Java for Dalvik, modified from the original JVM SPEC 2008 set.
• The Java and JNI Benchmark Set: 7 benchmarks taken from aforementioned set; plus their counterparts in which the most representative methods were implemented in JNI.
Simulation results
To evaluate the emulation speed of the proposed tool, we compared the simulation time of our implementation with the original Android emulator. They are both executed with the same user options. The Java benchmark set (Section 4) was used for this comparison and the configuration of the host computer used for the simulations was the following: Intel Core i7 860 2.80 GHz, 8 GB RAM, Samsung HD103SI HDD. The AVD used was: Android 4.0.3, 512 MB RAM, on ARM, x86 and MIPS CPUs. Figure 9 presents the average simulation slowdown (i.e.: the ratio between the simulation time of the proposed tool and the simulation time of the emulator without any modification) between three executions of each benchmark using the ARM, x86 and MIPS CPUs, respectively. It is important to note that the reference (the emulator without any modification) corresponds to each one of the architectures, i.e., when comparing the ARM architecture, the simulation time from the ARM emulator with and without the modifications were considered. The average slowdown was of 2.05 and 2.0 for the ARM and X86, respectively. We have this variation in the slowdown between the benchmarks because of their different control flow characteristics: the more control instructions a benchmark has, the more basic blocks it will present. In these cases, the hash table will be more stressed. Moreover, to keep track of all executed basic blocks, the chaining mechanism was disabled, as discussed in Section 3. The overhead created by disabling this mechanism is of 50%; while our hash table creates an additional overhead of 33%. The original MIPS emulator is slower than its ARM and the x86 counterparts because it saves data into the QEMU BB log that none of the others do. The amount of data that the original MIPS emulator saves is larger than the one profiled by our tool, and, on top of that, the MIPS emulator saves all this information to a file during the execution. On the other hand, our tool stores the additional data that is collected in memory, and saves it in chucks to the disk. For these reasons, our modified tool is faster than the original, for most of the benchmarks (the ones with values less than 1 in Figure 9 ). For the tested applications, the average speed up was of 1.83 times for this architecture.
Case study
To validate our tool, we have chosen the Java and JNI benchmark set, presented in Section 4. In this section, we will compare the number of executed instructions between ARM, MIPS and x86 architectures; evaluate the impact of the Dalvik virtual machine; estimate performance and energy consumption for different processors architectures and organisations; and assess a scenario where there is a fixed energy budget and the user wants to execute the maximum amount of applications possible.
Dalvik virtual machine and JNI
Dalvik Virtual Machine (VM) is a register-based architecture (Android, 2013a) , opposed to Java Virtual Machine (JVM) that is stack-based. Dalvik was designed to run on low memory environments and to allow multiple instances of the VM, so every application runs on its own instance, which provides security, isolation and effective memory management. A register-based architecture needs less VM instructions to implement a high-level code, even though this comes at the cost of increased instruction size, code size and memory fetches (Ehringer, 2010) . With larger instructions, register-based architectures take more time to execute each instruction, compared to stack-based architectures. However, the product between the time per instruction and the number of executed instructions is usually smaller in register-based architectures than in stackbased ones, which means that the former architecture will take less time to execute an application (Ehringer, 2010) . Through JNI, it is possible for Java code to interact with C or C++, by calling methods implemented in native code, so one may reuse legacy code and increase the application's performance in some situations. On the other hand, JNI compromises the application's portability and security, once the code needs to be compiled to each target architecture and it does not run on the Dalvik VM anymore. In addition, by using JNI, an overhead is created because of the context switches, which involves copying of operands in memory from one side to the other (from the Java to the native side or vice-versa). In this work, we will refer to applications implemented in Java and native code (through JNI) as JNI applications, in order to distinguish the three main types of Android applications: pure Java (Java applications), Java with native code (JNI applications) and those purely written in native code (native applications). Figure 10 presents the average number of executed instructions and the standard deviation for each benchmark considering the ARM, x86 and MIPS architectures. They were executed three times due to the small standard deviation (less than 1%). As can be observed, the MIPS executed more instructions than the ARM, because its ISA comprises simpler instructions (more RISC oriented), while the x86, in general, executed less instructions, because its CISC nature. Figure 11 presents the Java/JNI ratio, which is the ratio between the number of executed instructions by the Java only application and by the application with JNI calls (comprising JNI and Java codes, and the communication). As one can observe, the decision of developing some parts of the application using JNI depends on the type of application. For example, Scimark Sparse and FFT benchmarks have one method that lasts for 80-90% of the total execution time; and this method performs calculation with arrays and matrices. Such applications are very suitable for JNI use: the designer must reprogram only one method, and there are almost no context switches between the Dalvik and native code. On the other hand, MPEG Audio is quite the opposite: it has five methods that take only about 10% of the execution time each. When methods are simple or called several times, there are cases with slowdowns, because the faster execution of native code does not amortise the costs of context switching. Therefore, what the proposed tool shows to the developer is whether the usage of JNI can be beneficial or not for a specific application. It may depend on several factors, such as: the overhead of accessing Java attributes and calling a Java method through a native method; the number of times that this method is called; the complexity of the method, etc. In this case, the tool can be easily used to identify the hot spots that may bring performance advantages when implemented in JNI; and to verify the real improvements after the hot spots were implemented natively. This discussion highlights the importance of supporting the profile of native methods. Moreover, this analysis, which is done with a unique and centralised profiling tool, can be done at early design stages and consider any available processor architecture (e.g.: ARM, MIPS and x86) and organisation (e.g.: ARM Cortex-A8, Cortex-A9, Cortex-A15, etc.). Let us remark that, even though an instructionlevel simulator may not be as accurate as cycle-level simulators or a real device with hardware counters, for means of comparison between several architectures, it is the most appropriate solution: no real hardware or measurement equipment is needed; and it can be done at very early design stages, before releasing the application.
Java vs JNI applications
The impact of the Dalvik virtual machine
In order to evaluate the impact of the Dalvik VM for the chosen architectures, let us analyse Figure 11 again: the lowest Java/JNI ratio between all architectures for a given application represents the lowest impact of the virtual machine on this application's execution. This will be further explained with two examples. First, let us consider Scimark Sparse: it has a ratio of 8.72, 10.80 and 13.57 for ARM, x86 and MIPS, respectively. In this case, the Java version significantly executed more instructions than the JNI version regardless the architecture. Comparing these three virtual machines, the one that presented the lowest overhead was the ARM Dalvik because, proportionally, it executed less instructions than the other two virtual machines. Now, let us consider a benchmark that executed fewer instructions on its Java version than its JNI version: MPEG Audio, with a ratio of 0.66, 0.47 and 0.73 for the same architectures as before (ARM, x86 and MIPS) . In this case, the VM that presented the lowest overhead was the x86 Dalvik. This shows that the x86 Dalvik was able to optimise the code's execution better than the two other VMs. Furthermore, executing the JNI version of this application is more expensive than executing its counterpart implemented in Java only. Hence, the use of JNI is not advisable when one considers this specific case. Let us now analyse the Compress application. This is the most significant case regarding the overhead of the Dalvik VM: both MIPS and ARM Dalviks present 135.69% and 125.33% higher overhead than the x86 version, respectively.
In most cases, the Dalvik implemented for the x86 is better than its counterparts, with only two exceptions: Sparse and LU, on which the ARM version performs better. The MIPS Dalvik has the highest overhead, so it is not as optimised as the ARM and x86 Dalviks. Considering the average of all applications, the ARM and MIPS Dalvik versions have 39.54% and 65.18% more impact than the x86 Dalvik, while the MIPS Dalvik has 24.97% higher overhead than the ARM Dalvik.
Estimating performance and energy consumption
In this subsection, we have chosen the ARM and the x86 architectures in order to evaluate the energy consumption and performance, considering different processor organisations to reflect both embedded and general purpose systems. On the embedded system side, we consider the ARM Cortex-A8, Cortex-A9 and the Intel Atom, while the Intel Sandybridge Core i7 is used to represent a general purpose system. We obtained data on processor power dissipation, frequency and cycles per instruction (CPI), presented in Table 2 , from (Blem et al., 2013b; Blem et al., 2013a) . As it does not provide power dissipation per specific instruction or category, we did not use this feature, even though it is available in the proposed tool. The MIPS architecture is not considered in this section because there is no updated information available, such as cycles per instruction and power dissipation.
Main memory read/write power dissipation was obtained from CACTI 6.5, with the following configuration: 512MB, 8 banks, block size of 64 bytes and 45nm technology; which results in an access time of 8.26ns and an energy consumption of 2.66nJ/2.56nJ for a read or a write, respectively; and a leakage power of 109.613mW. Given the small difference between reads and writes, we considered the average of these two values as the energy consumption of each memory access (2.61nJ), regardless if it was a memory read or write. Figure 12 and Figure 13 present the performance and the energy consumption of the Cortex-A8, Cortex-A9, Intel Atom and Intel i7 based on the aforementioned data. These figures show the estimation for each Java or JNI benchmark and the geometric mean of the applications altogether. Comparing these results, the A8/A9 ratio is of 2.58 for the execution time and 1.88 for the energy consumption. It means that, for the selected applications, the A9 executes the applications two times faster than the A8 and consumes almost half the energy. The Atom/i7 ratio is of 2.05 for the execution time and 0.48 for the energy consumption. Therefore, even though the i7 doubles the performance, it consumes more energy than the Atom. It is important to note that both Figure 12 and Figure 13 are in logarithmic scale and that the impact on the energy/performance of programming Java or JNI applications can overcome the full execution of several applications. For instance, the Scimark Sparse can be executed only with the energy consumption savings of choosing the best version (Java or JNI) of a given application, e.g., the Java version of MPEG Audio instead of its counterpart implemented with JNI; or the opposite for the Scimark LU.
These figures also show how the performance and energy consumption can vary depending on the organisation of the same processor architecture (i.e.: processors that implement the same ISA). Moreover, one can compare processors that support different ISAs: the ARM to the x86. As can be observed, the A9 presented a lower performance when compared to the Atom; however, the former is much more energy efficient. Considering these four processors, the one that, on average, consumes less energy to execute the selected applications is the A9. Not surprisingly, it is one of the most used processors in the embedded market. This also highlights the differences between embedded systems' processors (Cortex-A8, Cortex-A9 and Intel Atom) and the general purpose one (Intel i7) in terms of performance and energy consumption. The Intel i7 was the one that consumed more energy to execute all the tested applications and it was the one with the best performance.
Considering the Cortex-A9 and the Intel Atom, one can note that, even though the x86 Dalvik presented a lower overhead and executed the applications faster when compared to the ARM Dalvik, the Atom consumed more energy to execute the tested benchmarks. However, if one compares the Cortex-A8 with the Atom, the latter consumes less energy in most benchmarks, besides having a better performance with a lower VM overhead. This highlights the importance of the chosen ISA and, most importantly, that the microarchitecture also influences when it comes to the efficiency of the virtual machine. 
Energy budget
Embedded systems' users want to execute the maximum number of applications without the device running out of battery. So, let us consider four devices that have identical hardware, but different processors (Cortex-A8, Cortex-A9, Intel Atom and Intel i7) and each device has 3kJ of energy available for the processor and memory. Then, we will examine a set of applications that can execute with this amount of energy, considering three different situations: all the applications must be written in Java only, all the applications must use JNI and, finally, the best solution between Java and JNI (i.e., the ones that consume less energy). This is presented in Table 3 . Considering Java applications only, it can be observed that the Cortex-A9 and the Intel Atom are able to execute all the seven applications within the energy budget. The i7, on the other hand, is able to execute only five applications. The Cortex-A8 can execute all but the Scimark MonteCarlo. Now, considering the subset of JNI applications only, even though none of the processors are able to execute all the JNI applications, the Cortex-A9 is able to execute one more application than the others (Compress).
Choosing the best solution between Java and JNI for each application allows the processors to execute more applications within the aforementioned energy budget and highlights the importance of investigating the efficiency of virtual machines. In this scenario, all the processors but the i7 are able to execute the seven applications, as it can be observed in Table 3 . The most significant case, when it comes to choose the best application between Java and JNI, is when one compares Core i7 with ARM. While the Core i7 is not capable of executing all applications, the Cortex-A9 is able to execute all of them consuming less than half of the energy budget.
Conclusions and future work
In this work we presented a novel tool to generate relevant information (e.g. performance and energy consumption estimation per application) for an easier data analysis at early design stages. As future work, we will create default profiles, with power and cycles costs of the main instruction categories, for multiple processors architectures and organisations. We will also study means to trace the applications with chaining enabled to speedup the simulation.
Through the proposed tool, it was also possible to analyse the impact of the Dalvik virtual machine in all officially supported Android CPU's architectures. The experiments show that the overhead of virtual machines in different architectures varies considerably for most of the applications. As one could observe, the x86 Dalvik is the one that presents the lowest impact on the execution of an application, followed by the ARM Dalvik. The MIPS version is the one which performs the worst. Although having a lower overhead on the execution, the Intel Atom consumed more energy to execute the tested applications than the Cortex-A9, and less energy when compared to the Cortex-A8. Most importantly, we demonstrated that the target architecture (which includes architecture and organisation) and the most appropriate application's implementation highly influences performance and energy consumption. As future work for the experiments, we will consider two more types of applications: those that use Native Activities, which are Android activities written in native code; and Renderscript applications. We also intend to build a framework to automatically investigate which is the best way to develop a given application to meet performance and energy requirements, considering several target devices.
