Abstract-Virtualization is a key enabler technology for cloud computing. It allows applications to share computing, memory, storage, and network resources. However, physical resources are not standalone and the server infrastructure is not homogeneous. The CPU cores are commonly connected to the shared memory, caches, and computational units. As a result, the performance of cloud applications can be greatly affected if, while being executed at different computing cores, they compete for the same shared cache or network resource. The performance degradation can be as high as 50%. In this work we present a methodology which predicts the performance problems of cloud applications during their concurrent execution by looking at the hardware performance counters collected during their standalone execution. The proposed methodology fosters design of novel solutions for efficient resource allocation and scheduling.
I. INTRODUCTION
Cloud computing becomes increasingly important. Its blend of flexible allocation and virtualization empowers scalability and reliability of applications, minimizing the burden posed on customers to fulfill these fundamental requirements. Virtualization decouples operating systems (OSs) and applications from hardware allowing easy migration (between different hardware and sites) and transparent upgrades.
The unleashed computing power enables new applications with virtually no bounds for scalability. At any time customers are assured that (almost) any demands for resources can be satisfied for an affordable price. So the core of a cloud is its management plane, which is responsible for reliability, scalability and efficiency.
Cloud computing efficiency has two main issues: i) how much of the equipment is needed to keep the cloud running and meet the Quality of Service (QoS) constraints negotiated in Service Level Agreements (SLA); ii) optimize operating expenses and minimize hardware power consumption. It is up to the management plane to decide either to "shrink" or "expand" the cloud. And this decision process is complex. It must account for a large number of parameters such as time, current load, load dynamics, size of the resource pool, networking, data center topology, and QoS [1] . The control plane enables/disables data center equipment and migrates VMs to different hardware as it is needed.
The virtualization layer ensures that VMs are logically isolated. However, even if VMs are perfectly isolated in the virtualized environment they still share hardware resources, and these resources are not infinite, i.e., if one VM uses them another VM has to wait. What is normally missing in cloud management planes is the consideration of mutual VMs' interference.
Most of the resource management algorithms consider CPU cores as unified resources adding performance in proportion to their number. Many consider a six-core CPU to be three times faster than a two-core CPU running at the same frequency, but this is an idealized situation. In reality the performance gain may vary due to the subsystems shared between cores. These subsystems can include CPU caches, memory bus, I/O lines, instruction decoders, branch predictors, computational units and other components. Therefore, under heavy load it is unlikely to see a linear gain in performance when adding more CPU cores. In fact, an increase the number of cores can even degrade VM performance for up to 50% due to the inter-VM interference [2] , [3] .
The level of interference varies and depends on many factors. As we discuss in this work, a proper placement of VMs can improve their performance significantly, while improper placement can become a cause of significant performance degradation. OS, libraries, programs and their versions, compilers used to build the system, hardware, BIOS settings are some of the factors that affect the performance results. It is difficult to predict how a particular VM will co-exist with other VMs in a given scenario. The most precise way to understand the loss in performance caused by a VM is to measure it. A good precision of this method is on the expense of its complexity -each VM should be executed with each other VM at least once during the measurement phase. It is possible for systems with limited number of executed VMs, but becomes impractical when the number of VMs is large.
Another factor contributing complexity is the VM diversity. There is a virtually infinite number of different virtual machines. But are they really so different in terms of interference? What if we find a simple way to rank arbitrary VMs according to the interference they cause to others? Such ranking may be imperfect, but it will certainly be useful for cloud management.
In this paper we present a novel methodology to classify and rank VMs based on the analysis of Hardware Performance Counters (HPCs) 1 . HPCs accumulate resource access statistics such as the number of time a VM accessed CPU caches or the success rate of the branch predictor. The main contributions of this paper are:
• The development of a methodology for profiling and II. METHODOLOGY Fig. 1 presents a high-level overview of the classification process. It consists of two major phases: learning phase and working (classifying) phase. During the learning phase the system figures out which hardware counters are the most important for application profiling and how they are related to system performance. The obtained knowledge is stored in a database. The classification is performed by comparing the profile of a new task with classes from the database.
Each VM behaves differently in the presence of other VMs competing for computing, memory, or network resources: it can run unaffected, degrade or even in performance. If it is known which VMs co-exist well and which do not, we can perform correct placement and schedule their execution properly. The key question remains: how to assess and predict VM interference? A straightforward way is to launch all the VMs together in pairs and measure their mutual interference. However, this would take too much time. A better way is to profile the VMs individually and then, based on their profiles, reason how they will interact with each other.
Our goal is to classify VMs based on two parameters of interest -sensitivity and interference -and use them to guide resource allocation and scheduling. The sensitivity is a measure of how the performance of a given VM is affected by the activity of other VMs. On the contrary, the interference describes how the behavior of a given VM affects operation of neighboring VMs. As both the sensitivity and interference cannot be measured directly, we derive their values from the analysis of HPCs.
A. Hardware Performance Counters
HPCs are a mechanism for application profiling. HPCs are built-in CPU circuits designed to collect runtime low-level execution statistics. HPCs consist of two parts: event detectors and 64-bit registers (counters). Each time an event occurs the register associated with this kind of event is incremented.
HPC statistics include the frequency of access to instruction decoders, caches and Floating Point Unit (FPU).
B. Virtual Machines Profiling
The mapping between interference/sensitivity and the values of HPCs can be measured through correlation analysis. For this, we first calculate interference and sensitivity for a small set of VMs. This can be done by launching all pairs of the VMs and measuring their execution performances. Then we compute linear correlation by calculating Pearson's product-moment correlation coefficient (Pearson's r) between the interference/sensitivity and each of the hardware counters. The HPCs with strong correlation are then selected to predict interference/sensitivity values of an arbitrary VM. This prediction can be done by, for example, regression analysis.
III. EXPERIMENTAL STUDY
Our experiments are executed on a small scale heterogeneous testbed accounting for different architectures, using collection of different benchmark applications.
A. Testbed
We use the following equipment: a) ARM Exynos: an "Odroid-U2" board based on Samsung Exynos-4412 system-on-chip with ARM Cortex-A9 fourcore CPU clocked at 1.7GHz. ARM Exynos has 2 GB of RAM and 8 GB eMMC storage. b) AMD FX: a board based on eight-core AMD FX-8120 CPU. The CPU consists of four two-core blocks, each equipped with its own 2 MB L2 cache. In addition, all two-core blocks share the same 8 MB L3 cache. In order to obtain stable and repeatable results the dynamic overclocking is disabled in BIOS. AMD FX is supplied with 16 GB DDR3-1600 RAM. A Crucial TM M4 Solid State Drive (SSD) with 64 GB is used as a storage.
All measurements were done by "perf stat" command using all relevant counters reported by "perf list" command.
B. Benchmarks
Benchmarks are selected to provide a comprehensive comparison of cloud workloads. The emphasis is given to the realworld programs, although a few synthetic benchmarks (matrix, blosc and integer) are present as well. We implemented a specific VM Manager, a subsystem which provides virtualization method appropriate for the platform -QEMU for AMD FX and Linux Containers (LXC [7] ) for ARM Exynos, as these platforms do not allow for standard VM management. The Resource Pool provides an abstraction layer for VMs to hardware resources. In our experiments each virtual machine was allocated 1 Gb of RAM and one core of the CPU. The measurement subsystem serves two different purposes. The first one is to collect the HPC statistics. The second purpose is to ensure that there is no activity in the system left unaccounted. Tables I and II show how the VMs affect the performance of each other during their concurrent execution. Columns specify the names of the benchmarks currently being measured (foreground VMs), while rows are associated with the benchmarks executed at the same time on the neighboring core (background). The numeric values reported show the performance degradation of the foreground benchmarks with respect to their standalone execution. The dark grey cells correspond to the performance degradation of more than 15%, while the light grey cells show the degradation between 10% and 15%. The values reported in square boxes signal the performance increase. The latter can be achieved when the concurrent benchmark execution makes the use of the shared hardware resources (e.g., caches) more efficient than during standalone runs. For AMD FX we report interference results for both sibling cores that share the local cache and distant cores that share less resources.
IV. PERFORMANCE RESULTS AND ANALYSIS
These synoptic tables give several interesting insights. The first one is clear: running VMs on different cores does not ensure performance isolation. The degradation of performance is in some cases definitely high and can easily affect even the perceived QoS. Another interesting observation is that in the AMD FX architecture the interference is largely independent from the cores' distance. Finally, the performance improvement, which is at first sight counter-intuitive. First of all, the gain is usually small and in some cases it can well be just a measure noise, even if the measures are the average of many runs. Second, e.g., for caches, the algorithm that manages them is based on a very complex heuristic. Thus, the scenarios and setups in which the heuristic works better than others are not so surprising, specially taking into account that sharing resources is far more common than running in isolation, thus, the heuristic has been studied and tuned for these cases. Table III presents the values of VM interference (how much the background affects the foreground) and sensitivity (how much a foreground is sensitive to have some other concurrent VM) calculated based on the performance degradation values reported in Tables I and II . The sensitivity is obtained as an average from the values each column, while the interference is an average on the rows. According to [8] , the performance degradation increases following a power law with the number of CPU cores: 5% interference leads to ∼18.5% and ∼33.6% of overhead for four and eight cores respectively, but we cannot draw strict conclusions on this yet.
There are many more interesting measures that space forbids putting in this paper. The complete data set collected during experiments is available at [9] .
A. Putting HPCs at work
Events can be different in nature, but all of them can be assigned a performance cost. For example, each LLC cache miss costs around 30-60 cycles of additional CPU time [10] . However, to understand the impact of the event on a system performance it is necessary to analyze the rate of the event occurrence in addition to the cost of the event. Low-frequency events do not contribute much to the VM interference. Therefore, we exclude low-frequency events from the analysis, even if they are costly. We operate with normalized frequency of events to avoid bias from the CPU clock rates. Table III gives a high-level perception of the sensitivity and interference "properties" of the benchmarks. A quick investigation, to no surprise, indicates that all the top interfering benchmarks heavily use memory subsystem. Sdagp operates over a large set of scattered data. This requires a lot of memory access requests that cannot be served efficiently. Matrix is optimized for efficient memory access, but uses all available cache and constantly displaces other cached data. Blosc was designed to compress scientific data on-the-fly at extreme rates. It is capable of fully occupy the memory bus, which heavily impacts all other applications requesting bus access.
The high demand for memory resources that makes a benchmark an interferer, also makes it sensitive to the same resources. Therefore, there is a clear correlation between interference and sensitivity figures.
So far for pure empirical observations. Now we proceed to identify what HPCs are the most representative of the interference/sensitivity properties. We compute the correlation between each performance counter and the values of interference and sensitivity and focus further investigation on those counters with high correlation. Fig. 3 : Four cases of interference for ARM Exynos: no interference (only nginx is running), negative interference (nginx runs with integer), medium interference (nginx with wordpress) and strong (nginx with matrix). Fig. 3 presents HPC measurements for different levels of task interference: no interference (nginx alone), low interference (nginx+integer), medium interference (nginx+wordpress), and high interference (nginx+matrix). Fig. 4 shows the measured HPCs counters for all the benchmarks on the Exynos, corresponding to the interference values of Table IIIa . As expected, for low interference there are no significant changes in HPCs values. This means benchmarks execute as if they were alone. However, when interference becomes significant the HPCs values reflect task competition for the resources. Surprisingly, there is no increase in any memory-related counters, which indicates that the main memory is not a primary bottleneck: the bottleneck arise inside the CPU before the main memory is accessed. The CPU cannot dedicate enough internal resources for all active cores. The cores compete for the resources, and this race creates a lot of pipeline stalls. This is reflected by "stalled-cycles-backend" parameter.
Referring again to Table II we now interpret results based on the HPC analysis. For sibling cores (Table IIa) , there are many cases of performance improvement (represented with negative values of interference). This is especially evident for ffmpeg benchmark. The reason for performance improvements becomes evident from the analysis of two HPC counters: TLB and L1 caches. These parameters indicate that some data (probably kernel code) is shared between VMs. Shared data may speedup the simultaneous execution because if one core accesses it there is a chance that another core already fetched it and stored in shared cache. Another possible reason is that the overhead for keeping cache lines coherent is lower for the processes running on sibling cores [2] . The average per-task interference is around 3%.
For distant cores (Table IIb) the average per-task interference is equal to 4%. It is higher than for the sibling cores which is due to the fact that sibling cores can share data more efficiently. The picture is quite similar to the case with sibling cores. There is no single largest contributing counter to the interference. The average per-VM interference is 4%. This is slightly higher than the previous case and might be due to cache coherency protocol.
Table V presents efficiency of the hardware platforms in terms of the Instructions Per (CPU) Cycle (IPC). The ARM Exynos has substantially smaller IPCs. Interestingly, higher number of IPC does not necessarily lead to a higher performance per MHz. This is due to the differences in hardware architectures and optimization of compilers.
The ARM platform has the following performance issues for ffmpeg and pgbench benchmarks. The ffmpeg benchmark is not optimized for this platform [11] . The results for pgbench can be limited by weak storage subsystem.
The experimental results presented in this section unveil clear differences between the analyzed hardware platforms. In general, ARM cores have less optimization features than traditional x86 CPUs. They do "less job" per CPU cycle. For both platforms the most interfering tasks are the tasks that do heavy memory use (matrix, nginx, sdagp and blosc). This proves that the memory-related subsystems are the biggest bottleneck of general-purpose CPUs [12] . This bottleneck makes them sensitive as well because their performance almost entirely depends on data availability.
We can conclude that ARM Exynos performs well in- teger operations and web-servicing stuff (wordpress and nginx benchmarks). Heavy memory-intensive applications (sdag/sdagp, matrix and blosc benchmarks) perform better on AMD FX.
B. Lessons Learned
During the experiments we faced a number of technical problems. In the following we list the most relevant of them.
1) The HPC implementations vary across platforms. Not only the number of available events differs across platforms, but also their meaning. We checked OS Linux sources and developer manuals to ensure that our interpretation is correct. 2) We observed that VMs may shortly migrate to another CPU even if they are "pinned" to specific CPU cores. These cases are rare and do not change the overall picture. 3) Care should be taken when a large number of events is enabled. The number of available events exceeds the number of counting registers by a factor of 5 to 10. If too many events are enabled simultaneously, then the operating system has to do time multiplexing which leads to loss of precision. 4) Drivers and I/O can significantly affect the performance.
We observed up to 40% deviation in instructions per second on heavy benchmarks if the system flushes disk caches. This does not affect the long-term average performance, but becomes critical for periodic measurements.
V. SUMMARY AND FUTURE WORK
In this paper we proposed a novel methodology for predicting performance of concurrently executed cloud applications by the analysis of hardware performance counters during their standalone execution. This becomes especially useful during resources allocation and scheduling in the large-scale computing systems that process incoming requests on-demand. Future work will be focused on the evaluation of the proposed methodology on a larger number of hardware platforms, exploring trace points as an alternative profiling method of applications, and developing a CPU scheduler based on the designed methodology.
VI. ACKNOWLEDGMENT

