Techniques for predicting the efficiency of multi-core processing associated with a set of tasks with varied CPU and main memory requirements are introduced. Prediction of CPU and memory availability is important in the context of making process assignment, load balancing, and scheduling decisions in distributed systems. Given a set of tasks each with varied CPU and main memory requirements, and a multi-core system (which generally has fewer cores than the number of tasks), we provide upper-and lower-bound models (formulas) for the efficiency with which the tasks are executed. In addition, a model for average CPU availability is introduced from the empirical study for applications that require a single predicted value instead of bounds. To facilitate scientific and controlled empirical evaluation, real-world benchmark programs with dynamic behaviour (CPU and memory requirements change in a short interval of time) are employed on UNIX systems that are parameterised by their CPU usage factor and memory requirement.
Introduction
The success of approaches for assigning threads (or processes) to multi-core systems relies on the existence of reasonably accurate models for estimating resource (CPU and memory) availability. System efficiency prediction involves estimating the system's behaviour for the set of tasks to be executed on it. Making such predictions is complicated owing to the dynamic nature of the system and its workload, which can vary drastically in a short span of time (Beltrán et al., 2008) . For any prediction approach, it can be useful to know a priori certain characteristics of tasks that are planned to be assigned to the system. As an example, it is useful to know the maximum amount of main memory a task will consume at any point during its execution (memory requirement). This information is useful to forecast the amount of time that may be consumed for memory paging activities. It is also useful to know the CPU requirement of the task, which is the fraction of time a task requires the CPU.
While a priori information such as memory and CPU requirements of tasks is useful, it is easier to obtain in some cases and more difficult in others. For example, in the case of Merge sort we can approximately determine the memory requirement based on the number of elements to be sorted. Tasks such as generating prime numbers, computing Fast Fourier Transforms, and related others require significant use of the CPU. One can also know the CPU and memory requirements of a task based on the information gathered from its earlier executions (Aioanei, 2011) . The execution of many scientific models such as economic, meteorological, and geographical models fall into this category, where the program remains the same and the data on which it operates changes over time.
Given the CPU requirements of tasks in a run queue, Beltrán et al. (2008) provide an analytical model to estimate the CPU availability (which is the percentage of CPU time that will be allocated) for the new task prior to its placement in the run queue. It is shown that in certain cases this information can be used to schedule the execution of tasks in such a way that the completion time of all the tasks is minimised. The work of Beltrán et al. (2008) has been extended here as follows: for a batch of tasks (each with its own CPU requirements), an analytical model is introduced to determined the CPU availability using the sum total of the CPU requirement of each of the tasks in the batch. Using this sum total posed a challenge because the CPU availability prediction is precise only when the order of task execution is known a priori. To address this challenge, the analytical model introduced in this paper provides upper-and lower-bounds on CPU availability. These bounds are necessary since the actual CPU availability depends on the order of execution of tasks in the batch. Thus, the analytical model in Hasan et al. (2014) is oblivious to the CPU scheduler.
When a processor accesses memory, it spends a significant amount of time waiting for the data to become available because of cache misses which may result in up to 50% of stall time. This situation generates huge overhead when the frequency of memory access increases. To overcome this situation, most of the recent hardware designs have implemented multi-threaded processor cores in which two or more hardware threads are assigned to each core. That way, if one thread stalls while waiting for memory, the core can switch to another thread (Avi et al., 2010) . To 0maintain its own architectural state, each core has its independent register set and thus appears to the operating system to be a separate physical processor. From an operating system perspective, each hardware thread (when hyper-threading is enabled) appears as a logical processor that is available to run a software thread. Thus, on a dual-threaded, dual-core system, four logical processors are presented to the operating system. We have incorporated the effect of hyper-threading in our new CPU availability and memory models for accuracy.
Further improvements to Hasan et al. (2014) are presented in the present paper; these are summarised as follows:
• First, a new CPU availability lower-bound model is introduced which provides a significantly tighter bound compared with the previous lower-bound model.
• Second, a new average CPU availability model is introduced, which is derived from the empirical analysis of the CPU availability model. This model can predict the average CPU availability for a set of tasks before placing them into the run-queue. The model provides a single CPU availability prediction value (instead of upper-and lower-bounds) for the set of tasks without explicit knowledge of the mapping between available cores and tasks.
• Third, the CPU availability model (consisting of the new lower-bound) is combined non-trivially with the memory model to derive a composite model. This model consists of analytically derived upper-and lower-bounds for predicting overall execution efficiency. Given a set of tasks (with known CPU and memory requirements) and set of compute nodes, we can use the composite model to determine the best node (in terms of thread execution efficiency) and assign the task to the best efficient node to achieve fastest processing.
It is necessary to have accurate models for estimating CPU and memory resources because of the dynamic nature of computer systems and their workload. A number of scenarios with real-world benchmark programs containing dynamic behaviour (regarding CPU and memory requirements) are carried out to measure the accuracy of CPU and memory prediction models. Confidence interval and moving average statistics from measured CPU and memory availability validate the utility of these models. In addition, an extensive empirical study is included to measure the accuracy of the proposed composite prediction and the average CPU prediction models as well.
The rest of this paper is organised in the following manner. Section 2 discusses relevant background related to the CPU availability and memory models, first introduced and motivates the importance of predicting resource availability. Section 3 presents the theoretical derivations of upper-and lower-bounds of the CPU availability model, case study measurements for single-and multi-core CPUs, statistical analysis of the results, and the new average prediction model. The theoretical derivations of upper-and lower-bounds for memory availability are introduced in Section 4. This section also includes the empirical studies including benchmarking, case study measurements for multi-core CPUs, and statistical analysis of the results. In Section 5, we introduce the composite prediction model and derive the upper-and lower-bounds from the CPU availability and memory model along with empirical studies. Finally, Section 6 contains concluding remarks, application areas of the introduced prediction models, and suggestions for future work.
Relevant work
Existing prediction models generally assume CPU resources are equally distributed among all processes in the run queue by following a Round Robin (RR) scheduling technique (Beltrán et al., 2008; Avi et al., 2010) . These models use the number of processes in the run queue as the system load index. As a result, the CPU availability prediction for a newly arriving process when there are currently n processes in the run queue is simply 1/(n + 1). This predictor is only accurate for CPU-bound processes, which share CPU resources in a balanced manner; consistent with the RR model assumption. But, when the processes also require I/O resources, this approach fails to provide accurate predictions and incurs large prediction errors. Thus, when there are processes in the run queue that require CPU and I/O resources, a more complex model is necessary to describe how the CPU is shared (Beltrán and Guzmán, 2009) . The introduced models overcome this limitation and are suitable for both CPU and I/O bound processes. Fedorova et al. (2007) also worked on operating system scheduling on heterogeneous core systems. They proposed thread-to-core assignment algorithms that optimise performance and demonstrate the need for balanced core assignment. The paper makes the case that thread schedulers for multi-core systems in a heterogeneous environment should target the following objectives: optimal performance, core assignment balance, response time, and fairness. In addition, Dwyer et al. (2012) introduced a practical new method for estimating performance degradation on multi-core processors, and it is application to workloads of clusters nodes. Yeh et al. (2014) have worked on multi-core system and the allied software parallelisation technique trends of system on chip (SoC) design. This paper adopts an electronic system-level (ESL) design methodology for higher system performance and lower energy utilisation for revealing a system performance prediction and analysis method for multi-core systems. Based on the scalable multi-core virtual platform, they have performed one to eight-core system performance trend prediction as well as multi-core system performance analysis. Experiments are conducted for observing the performance improvement of software parallelisation. This paper also discusses regarding hardware-software co-design and the hardware cost reduction. Zhang et al. (2014) proposed two fault-tolerant scheduling methods on multiprocessor systems through active and passive backup copies. The proposed model uses integer linear programming and heuristic algorithm for obtaining optimal results (polynomial run-time). The empirical tests conducted in this paper evaluates the proposed methods in terms of scheduling length for a set of DAG benchmarks and shows the effectiveness of proposed methods.
CPU availability model
The primary focus of this section is to present a CPU availability prediction model for estimating CPU availability for a set of tasks on multi-core systems. When the number of threads assigned to a multi-core processor is less than or equal to the number of CPU cores associated with the processor, the efficiency of the CPU is often near to ideal. When the number of assigned threads is more than the number of CPU cores, the resulting CPU performance can be more difficult to predict (Hasan et al., 2014) . For example, assigning two CPU-bound threads to a single core results in CPU availability of about 50%, meaning that roughly 50% of the CPU resource is available for executing either thread. Alternatively, if two I/O-bound threads are assigned to a single core, it is possible that the resulting CPU availability is nearly 100%, provided that the usage of the CPU resource by each thread is fortuitously interleaved. However, if the points in time where both I/O bound threads do require the CPU resource overlap (i.e., they are not interleaved), then it is possible (although perhaps not likely) that the CPU availability of the two I/O bound threads could be as low as 50%. Therefore, considering the number of processes in the run queue itself as a load index is not sufficient. The aggregate CPU load (sum total) of running threads needs to be considered for deriving an accurate model.
In the framework described here, the execution of a thread is modelled by a series of alternating work and sleep phases. A thread in a work phase will remain there until it has consumed enough CPU cycles to complete the allotted work of that portion of computation. After completing the work phase, the thread then enters in the sleep portion where it stays (does not consume CPU cycles) for a specific amount of time. Each thread is parameterised by a CPU usage factor (Hasan, 2011) . The CPU usage factor is defined as the time required to complete the work portion of a work-sleep phase on an unloaded CPU, divided by the total time of a work-sleep phase. A thread having zero sleep length has a CPU usage factor of 100%, which is also called a CPU-bound thread. Sleep phase length and the total amount of computational work to accomplish is computed based on the CPU usage factor. Its sleep length relative to that of the work portion based on CPU load factor is used to account for I/O or any other interruptions (Hasan et al., 2014) .
The results presented in Hasan et al. (2011) show that n running threads having similar work-sleep phases often result in 1/n CPU availability in a single-core machine. In such cases, the work phases get n times longer due to CPU contention among running threads, depicted in Figure 1(a) .
Alternatively, if the work portions of threads are staggered (interleaved) to where there is no overlap, by phase shifting, then there is no contention for the CPU resource and the CPU availability is essentially 100%, as shown in Figure 1 (b). The ideal phasing illustrated in Figure 1 (b) requires that the work of any two threads can be accomplished within the time of one sleep portion of a phase. 
CPU availability model upper-and lower-bounds
For a single core machine, the following model defines the upper-bound for CPU availability. Here, L is the aggregate loading factor, defined as the sum of CPU usage factors of all threads in a batch.
The upper-bound model represents the best case CPU availability, which is realised by the example of Figure 1 (b) in which none of the threads use the CPU resource concurrently. Provided that the sum of the usage factors of the threads is less than unity, then it is possible that the CPU availability could be as high as unity (i.e., 100%). When the sum of the CPU usage factors is greater than unity, then the best possible value for CPU availability is 1/L. For a single core machine, the following model defines the lower-bound for CPU availability. Here, n is the number of threads assigned to the single core machine.
The lower-bound model is associated with a situation in which the threads' usage of the CPU resource has maximum overlap, as depicted in Figure 1(a) , where all threads of the batch uses the CPU resource concurrently. The execution overhead due to work phase overlap increases when the number of concurrent threads in a batch increases or when the aggregate CPU load among concurrent threads increases. This additional overhead is estimated by the (2) which relies on both the number of concurrent threads and aggregate CPU load of the batch. For a multi-core machine with r cores, the following models define upper-and lower-bounds for CPU availability. The upper-bound model is:
and the lower bound is:
Note that equations (3) and (4) are generalisations of equations (1) and (2), i.e., for the case r = 1, equations (3) and (4) are identical to equations (1) and (2). The difference between upper-and lower-bounds can be significant for moderate values of aggregate loading. The difference for a multi-core machine can be as high as 0.446 for eight-threads with aggregate CPU loading of 0.50. To address this challenge, based on the possibility of having a tighter lower-bound, a sharpened CPU availability lower-bound model for multi-core systems is introduced below. The sharpened lower-bound model is derived from the previously introduced lower-bound model in this section.
As described earlier, the lower-bound model always considers the worst case staggering of threads which leads to intense performance degradation due to contention among all running threads. But, it has been observed from the empirical results that when the threads are lightly loaded (i.e., small L), the probability of this worst case staggering is low. That is, three threads in a batch having an aggregate CPU load of 0.3 has a lower probability of worst case staggering compared with three threads with aggregate CPU load of 0.9.
The new lower-bound model consists of two scenarios. In the first scenario, the aggregate CPU load value (L) is less than or equal to the number of CPU cores (r). Threads are lightly loaded in this scenario and the probability of worst case staggering is small. The following estimated value, denoted by μ, is subtracted from the denominator (L ≤ r) of the equation (4) to provide less penalty. The new parameter to this model is the core hyper-threading, defined as ξ (cores can have two or more hardware threads).
In the second scenario, the value of L > r. The CPU load of threads in the batch increases as the value of L increases. As threads' CPU requirement increases, the probability of worst case staggering increases. When the value of L equals to n (i.e., threads become CPU bound), the worst case staggering becomes obvious and the following equation goes to 0. The estimated value, denoted by μ, is subtracted from the denominator (L > r) of the equation (4) to provide less penalty when
That is, based on equation (6), when the value of L = n, the worst case staggering possibility is certain and nothing will be subtracted from the denominator (the sharpen lower-bound model value will be same as the previous lower-bound model). By subtracting the above equations from the introduced lower-bound model, we get the following sharpened lower-bound model.
Figure 2 plots the upper-and both lower-bound models for the case r = 4 and n = 16. In this figure, the horizontal axis represents aggregate CPU load of the running threads (in a scale of 0 to n) and the vertical axis represents CPU availability (in a scale of 0.0 to 1.0). In an unloaded system, when the aggregate CPU load for running threads are below or equal to the number of CPU cores, the availability is 1.0 (representing 100%) because every thread gets sufficient CPU resources. When the aggregate CPU load reaches above the number of CPU cores, CPU availability to threads decreases resulting in efficiency degradation. Note that the difference in upper-and lower-bounds becomes significantly tight for moderate values of aggregate loading for the new lower-bound model. In the following section, experimental studies are performed to determine actual measured values of CPU availability in relation to these bounds. 
Experimental studies

Overview
The purpose of the experimental studies is to empirically measure the CPU availability as a function of aggregate loading. For the empirical study, real-world benchmark programs presented in Table 1 are developed and utilised. The Monte Carlo π estimation, fast Fourier transformation, and Super prime number generator are compute intensive benchmark programs, thus suitable for validating the CPU availability model. The measured CPU availability associated with a collection of threads executing on a processor is defined by the ratio between the ideal time required to execute a benchmark thread (i.e., piEstimation, fft, or supPrime) on an unloaded processor divided by that thread's execution time on a loaded processor. About 2,000 independent test cases are conducted using these compute-intensive benchmark programs to satisfy all possible situations of the CPU model. Uniform sampling of data across the values of possible aggregate CPU loading has been ensured. A test run of each batch of threads provide one measurement (the minimum efficiency is considered for accuracy) of CPU availability. Algorithm 1 presents major parts of the experimental system. To ensure a uniform sampling of data across the values of possible aggregate loadings, a random value of aggregate loading between ( ) n × ε and n is chosen first. The value of 0.05, = ε denoting a 0.5% CPU-load, is used to represent the extreme lower CPU load value for a thread. A thread can not have CPU load value of 0.0, else it would never complete its assigned work. The selected aggregate load (sum-total) is then distributed among threads using expressions inside the inner for loop. For example, if a thread batch contains eight threads then the min-limit is (0.05 × 8) 0.4 and the max-limit is 8.0 (i.e., all CPU-bound threads). Then a random CPU load value between 0.4 and 8.0 is chosen and distributed among 8 threads using Algorithm 1.
Algorithm 1 Aggregate load distribution and measurement of execution efficiency
Input: Number of threads (n), and number of runs (Γ)
Compute sleep phase length for each thread [equation (8)] Compute total work amount for each thread [equation (9) The expressions of U i and L i is introduced to provide an upper-and lower-limit of available CPU load for the i th thread. A random CPU load value is selected from the range (L i , U i ) and assigned to the i th thread. The CPU load for the thread is then scaled and placed into T i . For example, for a scenario having two threads, a value of aggregate loading is chosen between a small value ( 2) × ε and 2.0; denote this value as L. Then a random value is chosen between max{ , ε (L -1.0)} and min{1.0, L}, which defines the CPU usage factor of the first thread, say T 1 ; the CPU usage factor of the second thread is then defined as T 2 = L -T 1 . In general, for n threads, Algorithm 1 is used to randomly assign the CPU usage factors for a given value of aggregate loading L. Based on CPU load of each thread, total amount of work (upper range) and sleep phase length values are derived. During each work phase, threads accomplish a fixed amount of work. Threads need to run several phases to complete the total amount of work. As described in Algorithm 1, a phase shift measured value is also assigned to each thread ranging from 0 to length of the phase which provides a degree of staggering of phases among the threads.
Experimental environment
The system used for evaluating the single core test cases is equipped with an Intel Xeon E5540 processor, 2.53 GHz clock speed, 1,333 MHz bus speed, and 4 GB of RAM. This machine has Linux kernel version 3.2.0-36. The system used for evaluating the multi-core test cases is equipped with Intel(R) Xeon(R) Quad-core W3520 processor, 2.67 G Hz clock speed, 1,333 MHz bus speed, and 6.0 GB of RAM. This machine also has Linux kernel version 3.2.0-36. The average CPU load (represents the average system load over a period of time) was 0.0161302 per core in a scale of 1.0 (in a 15 minute period) before running test cases which indicates that the machines were lightly loaded (essentially unloaded). Owing to the different configurations of the single-and quad-core systems, benchmarking for the single-core and quad-core machines were determined separately. The C programming language in Linux environment is used to implement the analytical model framework and benchmark programs. The gcc compiler version used is 4:6:3. Threads deployed here are independent tasks, meaning there are no interdependencies among threads such as message passing. Threads are spawned concurrently in a single-and multi-core machine with measured work load. When the batch of threads complete assigned work and terminates, an execution report is produced, which contains start time, work time, idle time, end time, number of phases, and others for statistical analysis.
Benchmarking
Benchmarking a thread on an unloaded system enables the calibration of parameters associated with the work and sleep portions of the phases to synthesise a particular CPU usage factor. For setting up the benchmarks in Table 1 , each thread is assigned a total work and sleep phase duration depending on the assigned CPU load value discussed in Algorithm 1. In the work portion of each phase, threads accomplish a fixed amount of CPU intensive work for generating prime numbers (based on the benchmark program). Depending on the work phase completion time (which is mostly constant for a machine) and CPU load, the sleep phase length of the thread is calculated using the following equation:
According to equation (8), the sleep time increases as the CPU usage decreases. Thus, a thread with low CPU usage will sleep longer than other threads having higher CPU usage factors. Though the work portion length of threads are same, due to sleep phase length variation due to CPU load, the total phase length (work + sleep) is different for threads. The total amount of work that needs to be accomplished by threads are defined by equation (9). In a time frame, say 10 seconds, the amount of work a thread with higher CPU load can accomplish is more than the amount of work a thread with lower CPU load. Thus, it is needed to calculate the variable amount of work, denoted by ω, required to accomplish by threads to finish assigned work and terminate at the same time.
Parameters of equation (9) are the maximum phase time (work + sleep phase time) among all threads, phase time of the thread, number of stages to complete the ideal work, and amount of accomplished work in each phase (SW). Measured total workloads are then assigned to respective benchmark threads so that the batch terminates at the same time for accuracy.
Empirical CPU availability case studies
For measuring the CPU availability of the single-core processor, three case studies were conducted in which multiple (2, 3, and 4) threads were spawned concurrently. An aggregate CPU load L was selected randomly, and distributed among threads as described in Algorithm 1. Figures 3 and 4 show measured CPU availability scatter graphs for two and four concurrent threads executing on the single-core processor, superimposed with the plots of the upper-and lower-bound models derived in Section 3.1. In these figures, the horizontal axis represents aggregated CPU load and the vertical axis represents CPU availability. Each small dot in these graphs is an independent test case measurement of CPU availability. There are 2,000 dots in each figure representing measured CPU availability value among the concurrent threads. A moving average line is also drawn through the data on the graphs for helping to visualise the average measured performance. A window size of 0.10 aggregate CPU load and incremental value of 0.01 was used to calculate the moving average values. A similar sliding window approach was employed to calculate the 90% confidence interval upper-and lower-limits. It is apparent from Figures 3 and 4 that the variation in CPU availability is low when the aggregate CPU loading is either relatively low or relatively high. Thus, CPU availability prediction is quite accurate when the CPU is either lightly or heavily loaded. When the total CPU loading is moderate, the measured CPU availability has more variation and thus it is less predictable. One of the intuitive reasons for low variation when threads have small CPU loading factors is their sleep phase lengths are wider, which decreases the probability of work portion overlap among the threads. On the other hand, when the aggregate CPU load is large, threads have smaller sleep phase and longer work phase lengths which almost always forces work phases to overlap and decrease performance. For measuring the CPU availability of the quad-core processor, three case studies were conducted in which multiple (eight, 12, and 16) threads were spawned concurrently to relatively compare the performance between single-and quad-core processors. A similar approach has been employed to calculate the moving average and 90% confidence interval values. CPU availability graphs for eight and 16 threads are shown in Figure 5 and Figure 6 respectively. Figure 5 shows a lower variation of CPU availability compared with Figure 3 , which shows CPU availability of two threads in a single-core machine. This decrease in variation may be explained as follows: During the thread execution life cycle, depending on CPU availability, threads might be allocated in different cores for load balancing which could decrease the probability of work phase overlap.
The empirical results for the quad-core processor also show that the CPU availability prediction is quite accurate when the CPU is either lightly or heavily loaded. When the total CPU loading is moderate, the measured CPU availability has more variation and thus it is less predictable.
In further reporting the results of the studies, it is convenient to define the normalised aggregate load, L/n, which is the aggregate load L normalised by the number of threads n. For sample values of normalised aggregate load, Table 2 shows the average measured CPU availability (Avg.), the difference in the upper-and lower-bound models (BD) and the difference in the 90% confidence interval limits (CI Diff) for two, three, and four concurrent threads on the single-core processor. Table 2 shows that the difference between the upper-and lower-bounds can reach as high as 0.375, for four threads and a normalised aggregate loading of 0.20. However, the measured 90% confidence interval difference for the same case is much smaller, around 0.105. The difference of the formula-based bound is more precise when the CPU is lightly or heavily loaded. Similarly, for a quad core processor, Table 3 shows that the difference between the upper-and lower-bounds can reach as high as 0.273, for eight threads and a normalised aggregate loading of 0.50. However, the measured 90% confidence interval difference for the case is much smaller, around 0.113. The difference of the formula-based bound is more precise when the CPU is lightly or heavily loaded for quad-core processor as well. Additionally, the empirically-based values for CI Diff can be used as a basis for creating sharper estimates of CPU availability.
Average CPU availability prediction
This section introduces a model for predicting the expected (on average) availability of CPU (instead of calculating the upper-and lower-bounds). As illustrated in previous subsection, exact values of CPU availability are difficult to predict because of dependencies on many factors, including context switching overhead, memory speed, CPU usage requirements of the threads, core hyper-threading, the degree of interleaving of the timing of the CPU requirements of the threads, and the characteristics of the thread scheduler of the underlying operating system. Due to the complex nature of the execution environment, an empirical approach is employed here to estimate expected CPU availability.
The model
Based on the observed shape and values of measured average CPU availability in Figures 5 and 6 , and Table 3 of Subsection 3.2.4, the mathematical model for average CPU availability is derived. It can be observed from Figures 5 and 6 that the shape of measured average CPU availability has similarities with the upper bound model [derived in equation (3)].
Because of the similarities with the upper-bound model, the average CPU availability model is also derived based on two scenarios. In the first scenario, when the value of aggregate CPU load (L) is less than or equal to the number of CPU cores (r), that is, running threads has low CPU utilisation, the likelihood of work phase contention is minimal. This results in less contending work phases among running threads but still some context switching overhead. The CPU availability upper-bound model is based on the situation in which all work portions of threads are phased out (interleaved) but context switching overhead is ignored. The following model represents the context switching overhead when L ≤ r. This estimated context switching overhead is subtracted from the first scenario of the upper-bound model to reflect the observed behaviour.
In the second scenario, when the value of L > r, the best available CPU for running threads is considered as r/L by the upper-bound model. This scenario is based on interleaved work phase of running threads. Though this model provides a good estimation but when the value of L increases, threads become more CPU hungry and the contention is more likely due to relatively wider work portions than idle portions. That is, more CPU load increases more CPU contention and more context switching overhead. The following estimated context switching overhead is subtracted from the second scenario of the upper-bound model to reflect the observed behaviour.
( 1 1 ) Equation (12), provides a model/explanation of what has been observed for thread assignment in processor on the average in Subsection 3.2.4 (and in Hasan et al., 2014) . The model of equation (12) depends on the aggregate CPU load (sum total) of the set of tasks, number of threads, number of processor cores, and number of hyper-threading in cores, denoted by ξ. For a multi-core machine, the following prediction model estimates the average CPU availability for a set of tasks:
The model of equation (12) is derived from equations (10) and (11) and depends on whether the aggregate CPU load is less than the number of cores or more than the number of cores. In the first situation, CPU resources are lightly loaded resulting in less context switching overhead and better efficiency. In the second situation, threads are moderate to highly loaded (i.e., aggregate CPU load is more than the number of cores), resulting in more context switching overhead and reduced efficiency. The usage of CPU resource for threads has maximum overlap as depicted in Figure 1(a) . In general, a thread with more CPU load incurs more contention for resources and context switching overhead. Therefore, an estimated context switching overhead is subtracted from the efficiency value in both cases (i.e., when for L ≤ r or L > r) to best-fit the average efficiency plot.
Average CPU availability model verification
As the average CPU availability model is derived from the Table 3 in Section 3.2.4, which presents the multi-core CPU availability empirical results, a new set empirical tests are conducted to validate the accuracy of the model. Figures 7(a) , 7(b) and 7(c) correspond to three new independent sets of experiments for 8, 12, and 16 concurrent thread batches in a quad-core machine. In this figure, the horizontal axis represents the normalised aggregate load, L/n, and the vertical axis represents the CPU availability for running threads. Each empirical study consists of 2,000 independent test runs like shown in Figures 5 and 6. 
A similar approach explained in Section 3.2.4 is also applied here to measure the running average value from the dataset. A window size of 0.10 aggregate CPU load and incremental value of 0.01 was used to calculate the moving average values to plot the line. It can be observed that the CPU availability model lines are smooth and follow similar pattern compared with the average lines obtained empirically. As the number of threads in a batch increases, the shape and pattern of the line of prediction model become almost identical to average measured line. Table 4 depicts normalised aggregate load, L/n, average measured CPU availability (Avg.), prediction model value (Model), and prediction error in percentage (Error). Table 4 contains the new empirical data to validate the model. It can be observed from the table that for a set of eight threads, the maximum prediction error is 3.24% when normalised aggregate loading is 0.5. For 12 threads, the maximum prediction error is 3.63% when normalised aggregate loading is 0.4. Finally, for 16 threads, the maximum error is 1.92% when normalised aggregate loading is 0.5. This analysis shows that the introduced average CPU availability model is consistent with the new set of data as well. Though the model is created from one set of data, it can accurately predict the average CPU availability for a new and independent set of data as well. The maximum predicted error is only 3.63%. Thus, the model predicts availability accurately and reliability and can be deployed in real-world applications for scheduling.
Outcome
Section 3 has introduced two prediction models: a given a set of tasks, aggregate CPU load of the set of tasks, and number of CPU cores of a machine, we can predict the upper-and lower-bounds of CPU availability for the set of tasks b given a set of tasks, aggregate CPU load, number of CPU cores, and hyper-threading of a machine, we can predict the average CPU availability for the set of tasks.
An extensive set of empirical studies have been conducted for validating the prediction models in single-and multi-core architectures. Thread availability scatter plots provide clear visualisation of measured performance based on the density of the dots in the plots. These empirically measured availability values are showed to generally fall within theoretically derived upper-and lower-bound models. A 90% confidence interval for measured availability is shown to provide tighter upper-and lower-limits than the theoretically derived models for upper-and lower-bounds. The bounds are necessary since the actual CPU availability changes based on the order of execution of tasks in the batch. Additionally, the prediction model for upper-and lower-bound is consistent with the dynamic nature of system and its workload, which can vary drastically in a short amount of time.
The empirical studies performed for validating the average CPU availability model shows that the model values follow the same shape and pattern of the experimentally measured average CPU availability lines. This model is suitable for applications that require a single prediction value instead of upper-and lower-bounds for dispatching tasks. This reliable and accurate predication model could be used as part of a scheduling module to determine the order in which tasks should be assigned to the system so that the completion time of all the tasks is minimised.
Memory availability model
In addition to CPU availability, the success of approaches for assigning threads (or processes) to multi-core systems relies on the existence of a reasonably accurate memory model for estimating the execution efficiency of tasks due to memory availability. This is because there is a strong relationship between a thread's total execution time and the availability of memory resources used for its execution. "Predicting the execution efficiency of tasks when the available memory is less than required memory is critical in making task assignment and scheduling decisions" (Beltrán et al., 2008; Tanenbaum, 2008) .
It is customary to keep several processes running in time-shared systems. For a process to execute, it is necessary for the system to swap in the required portion of code and data of the process into the primary memory. Additional pages are loaded only when they are demanded during program execution. As the degree of parallelism increases, over-allocating memory may result in severe performance degradation. Demand paging saves the I/O necessary to load pages that are never used but increases the probability of memory overrun. If one access out of 1,000 causes a page fault, the system can be slowed down by a factor as high as 40 because of the page fault service time. The effective access time is directly proportional to the page-fault rate (Avi et al., 2010; Tanenbaum, 2008) . Table 5 Terms and definitions of the memory model parameters
Terms Definition R i > 0 R i is the required number of memory frames by the process i where i = (1, 2, …, m). R > 0 R is the total required number of memory frames by all processes (R = R 1 + R 2 + … + R n ).
Ma is the total number of free memory frames available in the system. 
Bounds for memory availability
An analytical framework is developed here based on the availability of primary memory for predicting the efficiency associated with concurrent thread execution in multi-core machines. The primary contribution of this section is the derivation of upper-and lower-bound formulas for the memory model. 
The ideal thread execution time (τ) represents the best possible execution environment in which threads receive the required amount of primary memory in all situations and the cache memory hit ratio is 100%. That is, for each page, there is only one look-up in page table and a single access in primary memory and data are available in the primary memory. Thus, the ideal memory access time and the data process time, ρ, represents the ideal thread execution time.
( )
Upper-bound efficiency curves associated with equation (14) for various ideal thread execution time τ in sec (see online version for colours) Figure 8 illustrates a collection of plots for the theoretically derived upper-bound model (m) for different ideal thread execution times (τ). In this figure, the horizontal axis represents the memory availability percentage (the percentage of available primary memory with respect to the required amount of memory by threads) and the vertical axis represents the efficiency of thread execution associated with equation (13). It can be observed from Figure 8 that when the memory availability of the machine decreases, efficiency of the machine for process execution decreases due to increased activity by the pager (handling page faults). It can also be observed that when the value of τ is small (i.e., process runs for a short period of time), the efficiency decrease is more dramatic compared with large values of τ. For example, if two processes require same amount of memory but process one (P 1 ) runs for a shorter period of time compared to process two (P 2 ), due to same page fault service time overhead, P 1 has relatively more efficiency degradation compared with P 2 , which runs for a longer period of time. The lower-bound model represents the worst-case memory access time efficiency that additionally includes a primary memory access time because of cache miss hit, the backing store seek time for individual pages (virtual memory access by multiple concurrent threads will result in storage of pages in non-consecutive order), and the I/O queuing delay. The lower bound model, denoted by , m additionally includes the backing store overhead expressed in equation (16) and cache 'miss hit' overhead, can be represented as:
The backing store overhead, denoted by η, can be expressed as:
The backing store overhead (η) increases based on the number of page swap (in/out) and page seek time. The difference in upper-and lower-bound models might be significant, when the initial page fault starts, owing to the uncertainty of the backing store seek time, latency, and command queue delay, and others. The number of disk commands waiting in the queue is normally the factor that slows down the disk performance by increasing the average disk queue time Tanaka, 2005) .
Experimental studies 4.2.1 Overview
The purpose of these experimental studies is to measure the efficiency of thread execution as a function of memory availability for collections of threads in an actual dynamic environment. For the study, two sets of programs have been developed. Algorithm 2 presents the high-level pseudo-code for the first set of programs. The primary role of the first set of programs is to generate threads to allocate and initialise primary memory depending on the memory available percentage and hold the memory until all spawned benchmark threads terminates. The second set of programs are the memory benchmark programs presented in Table 1 . The binary search tree, dense matrix multiplication, and image rendering programs are selected as benchmark programs because these programs include memory intensive expressions and require reasonable amount of free memory for processing data. Required data are generated randomly and assigned to one or two dimensional array depending on the benchmark program. Algorithm 3 presents a high-level pseudo-code for creating and spawning the memory benchmark programs.
The Linux kernel creates a new virtual address space for each child thread because threads are created by using the fork() system call. The kernel creates a complete copy of the existing process's virtual address space; then it copies the parent process's vm_area_struct descriptors and will create a new set of page tables for the child (Avi et al., 2010; Tanenbaum, 2008) . The parent's page table and references will be directly copied; thus the parent and child shares the same physical pages in their address space for all test cases.
Experimental environment
The same experimental environment described in Section 3.2.2 is used for the memory studies. As the purpose of these tests is to precisely measure the effect of memory availability on thread execution, the number of concurrent threads generated using benchmark programs spawned were always less than or equal to the number of CPU cores to prevent any possible CPU related overhead due to context switching or other CPU-related inefficiencies.
Algorithm 3 Creating and spawning child threads for measuring thread execution efficiency of memory benchmark programs Input: Number of threads (n), 2D array size (s), pipe descriptor (fd), and linear search key.
Wait for the response of consume memory operation Generate a child process to govern next set of child threads via pipe communication 
Benchmarking
Each binSearchTree, denseMatMul, and iRender benchmark thread consists of a set of operations that are a mix of CPU and memory related expressions that are ideal for the empirical environment of the memory model. For setting up the benchmark, threads are assigned an independent two dimensional array size of 250 MB. For an array, memory allocation of rows have been completed first followed by the memory allocation of columns of each row. After memory allocation, each memory frame was initialised with a random value ranging from 1 to 10,000 to be searched. This initialisation also helps to avoid fake memory allocation (due to demand paging). The system function, meminfo, is invoked to validate the accuracy of memory allocation for respective threads. For the iRender benchmark program, several large bitmap files are rendered using the image smoothing algorithm and the output file is stored in secondary disk. For benchmark runs, available primary memory of the system was above the required memory of threads to ensure 100% availability of primary memory to complete assigned work. This approach ensures no additional overhead due to page fault service. Constants and literals in expressions (for generating work load) are avoided to eliminate caching effects. Machine's efficiency data is stored in multiple CSV files for benchmarking and statistical analysis.
Empirical memory availability case studies
A wide range of empirical studies were conducted to validate the proposed memory availability model of equations (13) and (15). For estimating the effect of memory availability in concurrent thread execution, four major independent case studies were conducted in which one, two, three, and four threads were spawned concurrently in a quad-core machine. Each case study is incorporated with 2,000 independent test runs for measuring the thread execution time where memory availability was varied from 120% down to 0% with respect to the memory requirement of threads.
Figures 9, 10, and 11 show measured thread execution efficiency scatter graphs for one, two, and four concurrent threads, superimposed with plots of the theoretically derived upper-and lower-bound models associated with equations (13) and (15). In these figures, the horizontal axis represents the memory availability percentage (system's memory availability with respect to running threads) and the vertical axis represents efficiency of thread execution (the ideal thread execution time divided by the test run execution time) on a scale of 0.0 to 1.0. Each small dot in these graphs are an independent test case measurement of thread execution efficiency. This empirical study on a quad-core machine shows that as page swap out starts due to lack of primary memory availability, the thread execution efficiency falls sharply because of the page fault service time. The variation in thread execution efficiency is the most, thus less predictable, when the availability is in the range of 100% to 80%. Reasons for wide variation are the initial backing store latency, command queue delay, and serial I/O processing. From the figures, it is apparent that the theoretical upper-and lower-bounds introduced in this paper accurately bind the actual measured values of thread execution efficiency. Figure 11 shows a smaller bound difference compared with one and two thread cases. The variation in execution efficiency reduces when the number of threads in a batch increases. It can also be observed that when the memory availability is above 100%, the bound difference increases as the number of threads increases. As the number of threads increases, the probability of page swap out and queuing delay increases. The difference of the model-based bound is more precise when the memory availability for threads is either high or low. In further reporting the results of the studies of one, two, three, and four concurrent threads in a quad-core machine, Table 6 presents the value of theoretically derived upperbound model (UB), lower-bound model (LB), the difference between upper-and lower-bounds (BD), and the average measures thread execution efficiency (AVG) while the memory availability was varied from 0% to 120%. From the table, it can be observed that the difference between upperand lower-bounds can reach as high as 0.688 when n = 4 and MAP = 98 (i.e., when the page fault just starts). However, the measured average thread execution efficiency is for this same case is much smaller, around 0.379. Bound differences are low for the four thread case when the memory availability is relatively high or low.
Outcome
Analytical models (and empirical studies) were developed for predicting (and measuring) execution efficiency of concurrent threads in Linux environment for multi-core architectures as a function of memory availability. As would be expected, degradation in thread execution efficiency occurs when primary memory availability is less than the total required memory by threads. In addition to memory availability, the total amount of memory required by concurrent threads is a factor in predicting the thread execution efficiency. When the total memory requirement is higher than the total available memory, the relative performance can have a significant impact on execution time. Specifically, increased page swap out by concurrent threads and long queue delay due to serial I/O degrades thread execution efficiency. A memory efficiency scatter plot provides a clear visualisation of measured efficiency based on the density of the dots in the plot. These empirically measured values are showed to generally fall within theoretically derived upper-and lower-bound models.
Composite CPU and memory model
The CPU availability and memory models from Sections 3 and 4, respectively, are used in this section to derive a composite CPU/Memory prediction model for estimating the efficiency of thread execution in multi-core systems. As new processes are assigned or existing processes complete execution, the CPU and memory availability of a given machine can change significantly in a short span of time. Existence of the composite prediction model is important because there exists a wide range of applications and scientific models (e.g., geological, meteorological, economical and others) that requires extensive use of both CPU and memory resources, repeatedly. Assigning a set of batch tasks in a distributed environment (distributed schedulers) can utilise the composite model to determine the order (or find subsets) in which the tasks should be assigned to compute nodes so that the completion time for all tasks is minimised prior to its placement in the run queue (Hasan et al., 2013; Garay et al., 2013) . 
Bounds of composite CPU and memory model
A composite analytical framework consisting of upper-and lower-bounds are derived for estimating the overall efficiency for a batch of tasks in multi-core systems. Input parameters for the composite upper-and lower-bound models include the CPU availability upper-and lower-bound models derived in equations (3) and (7), and memory efficiency upper-and lower-bound models derived in equations (13) and (15), respectively. The composite CPU and memory upper-bound model, denoted by , cm represents the best case efficiency value of a machine for concurrent thread execution can be represented by the product of two models because CPU and memory are the two primary factors used to characterise machines.
The values of c and m represents the relative impact of a machine's overall efficiency due to loading of machine's CPU and memory resources, respectively. The composite model upper-bound model consists of two scenarios. When M a is greater than or equal to R, the execution efficiency value depends on the value of aggregate CPU loading. In the second scenario, when M a is less than R, the execution efficiency value depends on the composite effect of CPU and memory.
The composite CPU and memory availability lower bound model, denoted by , cm represents the worst case efficiency value of the machine for thread execution is defined by the product of CPU and memory lower bound.
A new parameter, μ, is incorporated to represent the probability of worst-case staggering of work-phases. When the value of L is less than or equal to r, the value of μ is L/(n × ξ) because of the small probability of worst-case staggering. When the value of L is greater than r, the value of μ becomes L/(n + r × ξ) because of increased work-phase overlap probability.
The lower-bound of the composite prediction model consists of two scenarios. When the value of Ma is greater than or equal to R, the execution efficiency value is derived from the product of CPU availability lower-bound and memory models' first lower-bound expression. In the second scenario, when M a is less than R, the execution efficiency value is derived from the product of CPU availability lower-bound model and the memory lower-bound models' second expression to reflect the composite effect accurately. 
Figure 12(a) shows surfaces of the theoretically derived upper-and lower-bound model in which horizontal axes represent aggregate CPU loading (of the set of tasks) and memory availability percentage (with respect to memory requirements of the set of tasks), and the vertical axis represents the overall efficiency of the machine. It can be observed that for CPU loading values greater than the number of cores (here, r = 4), the idealised function for c decreases according to the ratio of the number of cores to the total CPU loading. The significant degradation of efficiency value can also be observed when the memory availability for threads decreases accordingly from 120% down to 0%. Ideally, if a machine's CPU and memory resources are both lightly loaded, then the efficiency of the machine is at or near its maximum value. Figures 12(b) and (c) show the effect of changing the value of τ in the upper-bound model. The purpose of this figure is to show the effect of ideal thread execution time (τ) when the value changes from 1 to 100 sec. It can be observed among surfaces that when the value of τ increases from 1 to 100, the overall efficiency of the node increases considering the memory requirement is same in all three cases. The efficiency value increases because when the program runs for longer period of time, the page fault overhead (which is same for all the three cases) has less impact compared to a program which runs for a shorter period of time. A similar effect is illustrated in Figure 8 of Section 4.1.
In the following section, experimental studies are performed to measure processor efficiency while executing threads to verify that the introduced upper-and lower-bound composite models can estimate the actual empirical measured efficiency surface.
Experimental studies
Overview
The benchmark programs used for measuring the overall efficiency of a machine is presented in Table 1 . The Bitonic Sorting Network, Sparse matrix vector multiplication, and Tridiagonal Solver (using Gaussian elimination) program consists of expressions that are mix of CPU and memory related operations and thus ideal for the composite model empirical studies. Similar approaches are taken for modelling threads by a series of alternating work and sleep phases as described in Section 3.2.1. The purpose of the experimental study is to empirically measure the efficiency of the machine as a function of aggregate CPU loading and memory availability factors. Aggregate CPU loading and memory requirement values are selected randomly and distributed among threads by using Algorithms 1, 2, and 3, respectively. Uniform sampling of data across the values of possible aggregate CPU loading and memory availability has been carried-out. For implementing the benchmark and framework programs, most of the modules of the CPU and memory availability programs are reused.
About 18,000 test runs are conducted for three independent sets of test cases in which eight, 12, and 16 threads are spawned concurrently in a quad-core machine. This vast number of test runs is conducted to sufficiently cover the possible scenarios of thread executions. A similar experimental environment as described in Section 3.2.2 has also been deployed here for conducting the case studies.
Benchmarking
A similar but combined approach has been performed (described in Sections 3.2.3 and 4.2.3) for setting up the empirical framework programs of the composite model. For the memory portion, each thread was assigned an array size of 50MB and the data is generated randomly for initialise the array. For the CPU portion, sleep phase length and total work load values are calculated by utilising equations (8) and (9), respectively. The thread with the minimum CPU load was assigned 1.0 × 10 8 units of work load, which should be accomplished in 25 work-sleep phases. In the work component of each phase, threads accomplish 4.0 × 10 6 units of fixed computational work. Thus, when the memory requirements of threads is less than the available memory of the system, equation (19) can be used to determine the ideal thread execution time, denoted by ν. Input parameters for computing ideal execution time (ν) are work phase time in the quantum (Q w ), number of work phases to complete the total work (N w ), and CPU usage factor of the thread (L i ).
According to equation (19), the value of ν can be measured dynamically using the work phase length (measured by multiple test runs), number of work phases (total work/work accomplished in a single phase), and the CPU usage factor of the thread. All parameter values of equation (19) are available while the program is running. As the total amount of work depends on the CPU load of threads, the variable ideal thread execution time for threads (because of variable work load) can be measured accurately by utilising the equation (19).
Case studies and results
For measuring the overall efficiency value of thread execution, three independent case studies were conducted on a quad-core machine for eight, 12, and 16 threads. Aggregate CPU load and memory availability values were selected randomly and distributed among the threads as described in Sections 3.2.1 and 4.2.1, respectively. As the focus of this empirical study is to measure the effect in overall efficiency of machines when both CPU and memory usage are varied, the number of threads deployed is always above the number of CPU cores of the machine and the memory availability is varied from 120% down to 0%. 
Figure 13(a) shows measured efficiency surface for the execution of eight threads in a quad-core machine. About 4,000 independent test cases are carried out to capture all possible execution efficiency scenarios due to CPU and memory availability variation. A moving average is taken from these test results with a sliding window size of 0.10 aggregate CPU loading and 0.5% of memory availability, and incremental value of 0.01 CPU loading and 0.5% memory availability. The data is then converted to a two dimensional matrix format for plotting a 3D surface diagram. It can be observed from Figure 13 (a) that the efficiency value decreases significantly when the aggregate CPU load reaches beyond the CPU cores (here, r = 4) because of the increased CPU contention among running threads. Figure 13 (a) also illustrates the decrease of machine efficiency because of the decrease of memory availability. The efficiency decrease due to memory availability is moderate because the τ value of test runs are large (refer to Figure 8 ; here, τ = 30 sec).
In Figures 13(b) and 13(c), the efficiency surface is superimposed with upper-and lower-bound surfaces associated with equations (17) and (18), respectively. The purpose of Figures 13(b) and 13(c) is to verify that the efficiency surface shown in Figure 13 (a) is bound using the composite upper-and lower-bound models. From the empirical results and measured efficiency surface plot in Figure 13 (b), it is apparent that theoretically derived upper-and lower-limits introduced in this section do bound the actual measured efficiency surface very well (note: the efficiency surface diagram is given a different and light colour so that any overlap can be detected immediately). It can also be seen from Figure 13 (b) that bounds are tighter when the aggregate CPU and memory loading is either relatively low or relatively high. Figure 13(c) shows the same surface diagrams in a different perspective to illustrate that there are no crossings among measured efficiency, upper-, and lower-bound surfaces. This figure also illustrates the degradation in efficiency when the memory availability decreases.
In the next set of test cases, 16 threads were spawned concurrently for measuring the overall machine efficiency of thread execution. A similar approach has been taken, like eight threads, for conducting the empirical case studies. About 8,000 independent test cases were conducted in which CPU and memory availability were selected randomly and distributed among threads to cover all possible scenarios. Moving averages are taken from these empirical studies resulting data with a similar sliding window approach for generating efficiency surface data. Upper-and lower-bound surface data are generated from the same model equations (17) and (18), respectively. Figure 14 (a) shows the average measured efficiency surface for 16 threads in a batch. A similar decrease in efficiency has also been observed when CPU loading increases and memory availability decreases. Figures 14(b) and 14(c) illustrate the efficiency surface superimposed with theoretically derived upper-and lower-bound surfaces. From Figures 14(b) and 14(c) , it is apparent that the upper-and lower-bound prediction models introduced here can also bound the actual measured efficiency surface very well.
Experimental studies and statistical analysis of eight, 12, and 16 concurrent threads conducted independently validates the accuracy of the introduced prediction models. The proposed composite prediction model provides a basis of an empirical model for estimating execution efficiency of processes while CPU and memory resources are uncertain.
Conclusions
This paper has developed analytical models (and conducted empirical studies) for predicting (and measuring) efficiency of multi-core machines for a set of tasks. It is necessary to have accurate models for estimating CPU and memory resources because of the dynamic nature of computer systems and their workload. The composite prediction model proposed in this paper is derived from the CPU availability and memory models. The CPU availability and memory models have been introduced for predicting the overall machines' efficiency for concurrent thread execution on a time-shared system. The prediction models were validated empirically by an extensive set of case studies involving real-world benchmark programs having dynamic behaviours. Additionally, an average CPU availability prediction model is introduced and validated empirically. The introduced average CPU availability model is suitable for multi-core systems with hyper-threading enabled.
The results of empirical studies are presented in this paper in the form of scatter plots, surfaces, and tables. Scatter plots provide a clear visualisation of measured efficiency based on the density of the dots in the plot for CPU availability and memory models. For the composite prediction model, surface plots for measured thread execution efficiency along with upper-and lower-bound surfaces provide a clear visualisation of all possible cases of combined CPU and memory availability variation in thread execution. These empirically measured availability values are shown to generally fall within theoretically derived upper-and lower-bound models.
The introduced composite prediction model can be used as a building block for a distributed task scheduler to determine the order (or find sub sets) in which tasks should be assigned to compute nodes for minimising the total execution time prior to its placement in the run queue. All the obtained results justify the strength of the introduced models for predicting the efficiency of a machine while executing threads. Hence, the ability of these models to predict the CPU availability and processor efficiency for thread execution while the resources (CPU and memory) availability is uncertain in dynamic environment has been demonstrated. Thus, the usefulness of the introduced prediction models in real-world applications for estimating the execution efficiency of tasks before they are placed into the run-queue by a scheduler has been motivated, and is the topic of future studies.
